Proactive management of software aging

22
by V. Castelli R. E. Harper P. Heidelberger S. W. Hunter K. S. Trivedi K. Vaidyanathan W. P. Zeggert Proactive management of software aging Software failures are now known to be a dominant source of system outages. Several studies and much anecdotal evidence point to “software aging” as a common phenomenon, in which the state of a software system degrades with time. Exhaustion of system resources, data corruption, and numerical error accumulation are the primary symptoms of this degradation, which may eventually lead to performance degradation of the software, crash/hang failure, or other undesirable effects. “Software rejuvenation” is a proactive technique intended to reduce the probability of future unplanned outages due to aging. The basic idea is to pause or halt the running software, refresh its internal state, and resume or restart it. Software rejuvenation can be performed by relying on a variety of indicators of aging, or on the time elapsed since the last rejuvenation. In response to the strong desire of customers to be provided with advance notice of unplanned outages, our group has developed techniques that detect the occurrence of software aging due to resource exhaustion, estimate the time remaining until the exhaustion reaches a critical level, and automatically perform proactive software rejuvenation of an application, process group, or entire operating system, depending on the pervasiveness of the resource exhaustion and our ability to pinpoint the source. This technology has been incorporated into the IBM Director for xSeries servers. To quantitatively evaluate the impact of different rejuvenation policies on the availability of cluster systems, we have developed analytical models based on stochastic reward nets (SRNs). For time- based rejuvenation policies, we determined the optimal rejuvenation interval based on system availability and cost. We also analyzed a rejuvenation policy based on prediction, and showed that it can further increase system availability and reduce downtime cost. These models are very general and can capture a multitude of cluster system characteristics, failure behavior, and performability measures, which we are just beginning to explore. 1. Introduction Software aging Unplanned computer system outages are more likely to be the result of software failures than of hardware failures [1, 2]. Moreover, software often exhibits an increasing failure rate over time, typically because of increasing and unbounded resource consumption, data corruption, and numerical error accumulation. This constitutes a rCopyright 2001 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/01/$5.00 © 2001 IBM IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL. 311

Transcript of Proactive management of software aging

by V. CastelliR. E. HarperP. HeidelbergerS. W. HunterK. S. TrivediK. VaidyanathanW. P. Zeggert

Proactivemanagementof softwareaging

Software failures are now known to be adominant source of system outages. Severalstudies and much anecdotal evidence point to“software aging” as a common phenomenon,in which the state of a software systemdegrades with time. Exhaustion of systemresources, data corruption, and numericalerror accumulation are the primary symptomsof this degradation, which may eventually leadto performance degradation of the software,crash/hang failure, or other undesirableeffects. “Software rejuvenation” is a proactivetechnique intended to reduce the probabilityof future unplanned outages due to aging.The basic idea is to pause or halt the runningsoftware, refresh its internal state, and resumeor restart it. Software rejuvenation can beperformed by relying on a variety of indicatorsof aging, or on the time elapsed since the lastrejuvenation. In response to the strong desireof customers to be provided with advancenotice of unplanned outages, our grouphas developed techniques that detect theoccurrence of software aging due to resourceexhaustion, estimate the time remaining untilthe exhaustion reaches a critical level, andautomatically perform proactive softwarerejuvenation of an application, process group,or entire operating system, depending on the

pervasiveness of the resource exhaustionand our ability to pinpoint the source. Thistechnology has been incorporated into the IBMDirector for xSeries servers. To quantitativelyevaluate the impact of different rejuvenationpolicies on the availability of cluster systems,we have developed analytical models basedon stochastic reward nets (SRNs). For time-based rejuvenation policies, we determined theoptimal rejuvenation interval based on systemavailability and cost. We also analyzed arejuvenation policy based on prediction, andshowed that it can further increase systemavailability and reduce downtime cost. Thesemodels are very general and can capture amultitude of cluster system characteristics,failure behavior, and performability measures,which we are just beginning to explore.

1. Introduction

Software agingUnplanned computer system outages are more likely to bethe result of software failures than of hardware failures[1, 2]. Moreover, software often exhibits an increasingfailure rate over time, typically because of increasing andunbounded resource consumption, data corruption, andnumerical error accumulation. This constitutes a

rCopyright 2001 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) eachreproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of thispaper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of

this paper must be obtained from the Editor.

0018-8646/01/$5.00 © 2001 IBM

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

311

phenomenon called software aging [3], and may be causedby errors in the application, middleware, or operatingsystem. Under aging conditions, the state of the softwaredegrades gradually with time, inevitably resulting inundesirable consequences. Some typical causes ofthis degradation are memory bloating and leaking,unterminated threads, unreleased file-locks, datacorruption, storage-space fragmentation, and accumulationof round-off errors. This phenomenon has been reportedby Huang et al. [3] in telecommunications billingapplications, where over time the application experiencesa crash or a hang failure. Avritzer and Weyuker discussaging in telecommunication switching software, in whichthe effect manifests itself as gradual performancedegradation [4]. Software aging has been observed notonly in specialized software, but also in widely usedsoftware, where rebooting to clear a problem is acommon practice.

Aging occurs because software is extremely complex andnever wholly free of errors. It is almost impossible to fullytest and verify that a piece of software is bug-free. Thissituation is further exacerbated by the fact that softwaredevelopment tends to be extremely time-to-market-driven,which results in applications which meet the short-termmarket needs, yet do not account very well for long-termramifications such as reliability. Hence, residual faultshave to be tolerated in the operational phase. Theseresidual faults can take various forms, but the ones thatwe are concerned with cause long-term depletion ofsystem resources such as memory, threads, and kerneltables. The essentially economic problem of developingand producing bug-free code is not the problem at hand;instead we address one of the problems that arises fromthe prevailing approach to developing software, and oneapproach to attacking that problem is softwarerejuvenation.

Software rejuvenationTo counteract software aging, a proactive technique calledsoftware rejuvenation has been devised [3]. It involvesstopping the running software occasionally, “cleaning” itsinternal state (e.g., garbage collection, flushing operatingsystem kernel tables, and reinitializing internal datastructures) and restarting it. An extreme but well-knownexample of rejuvenation is a system reboot. There arenumerous examples in real-life systems where softwarerejuvenation is being used. For example, it has beenimplemented in the real-time system collecting billingdata for most telephone exchanges in the United States[5]. Software capacity restoration, a technique similar torejuvenation, has been used by Avritzer and Weyuker in alarge telecommunications-switching software application[4]. In this case, the switching computer is rebootedoccasionally, which restores its service rate to the peak

value. Grey [6] proposed performing operations solely forfault management in Strategic Defense Initiative (SDI)software which are invoked whether or not the fault exists,and called it operational redundancy. Tai et al. [7] haveproposed and analyzed the use of onboard preventivemaintenance for maximizing the probability of successfulmission completion for spacecraft with very long missiontimes. The necessity of performing preventive maintenancein a safety-critical environment is evident from theexample of aging in Patriot missile software [8]. Thefailure, which resulted in loss of human lives, might havebeen prevented had the operators heeded the advice thatthe system had to be restarted after every eight hours ofrunning time. The Apache Web Server1 (from The ApacheSoftware Foundation) provides a means to prevent itselffrom becoming too much of a resource burden on asystem. Apache has a controlling process and a handlerprocess. The controlling process watches the handlerprocess to ensure that it is running up to standard. Thehandler process, on the other hand, handles requests fromthe clients. When the handler process is deemed to bein a bad state, the controlling process stops it and startsanother process.

Most current fault-tolerance techniques are reactive innature. Proactive fault management, on the other hand,takes suitable corrective action to prevent a failure beforethe system experiences a fault. Although this techniquehas long been used on an ad hoc basis in physical systems,it has only recently gained recognition and importance forcomputer systems. Software rejuvenation is a specific formof proactive fault management which can be performedat suitable times, such as when there is no load on thesystem, and thus typically results in less downtime andcost than the reactive approach. Since proactive faultmanagement incurs some overhead, an important researchissue is to determine the optimal times to invoke it inoperational software systems. Proactive fault managementcan be greatly enhanced by the ability to predict the faultfar enough in advance that one can take action to avoid ormitigate its effects. Resource exhaustion by its very natureoffers clues that failure is imminent, in the form ofparameters that can be monitored, extrapolated, andcompared to thresholds via suitable algorithms.

Software failure prediction and rejuvenation in acluster environmentA cluster is a collection of independent, self-containedcomputer systems working together to provide a morereliable and powerful system than a single node by itself[9]. Clustering has proven to be an effective method ofscaling to larger systems for added performance, moreusers, or other attributes [10], as well as providing higher

1 The Apache Group, Apache http Server Project, http://www.apache.org.

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

312

levels of availability and lower management costs. Oneof the benefits of clustering is its natural redundancy ofhardware and software components. A number of singlepoints of failure can be removed by cluster systems. Aspart of a clustered system, a “failover” process is usuallyemployed, which transfers workload to another portion ofthe clustered system when a hardware or software failureoccurs. An objective of the failover process is for it tooccur gracefully, without an end user knowing that afailure has occurred. In practice, the successfulachievement of this objective is highly application-dependent. Also, in order to ensure the ability to performa failover, sufficient spare resources must be available toaccommodate the migrated workload. Another advantageof a clustered system is its ability to improve systemmaintenance. For example, if a specific resource of acluster is in need of repair and a spare resource isavailable, it is possible, in a planned manner, to movethe load from the resource being repaired, perform ashutdown, and remove and replace the resource, ifnecessary.

Software rejuvenation technology is a natural fit withclustered systems. Within a clustered environment,rejuvenation can be performed by invoking the cluster’sfailover mechanisms, either on a periodic basis based onprior experience of the time to resource exhaustion, orextemporaneously, upon prediction of an impendingresource exhaustion. Our analyses (discussed in thefollowing sections) show that combining softwarerejuvenation with clustering significantly increases systemavailability. Using the node failover mechanisms in a high-availability cluster, one can maintain operation (thoughpossibly at a degraded level) while rejuvenating one nodeat a time, assuming that a node rejuvenation takes lesstime to perform and is far less disruptive than recoveringfrom an unplanned node failure. Because the user’sapplication has presumably been written to survivenode failovers anyway, this environment has the addedadvantage of allowing rejuvenation to be transparent tothe application. Simple time-based rejuvenation policies,in which the nodes are rejuvenated at regular intervals,can be implemented easily, using the existing cluster-management infrastructure. System availability can befurther enhanced by taking a proactive approach to detectand predict an impending outage of a specific server inorder to initiate planned failover in a more orderlyfashion. This approach not only improves the end user’sperception of service provided by the system, but alsogives the system administrator additional time to workaround any system capacity issues that may arise.

Because of these attractive possibilities, we haveimplemented the first commercial version of the xSeries*Software Rejuvenation Agent (SRA) within a high-availability clustered environment, which is being managed

by the IBM Director system management tool. We arealso in the process of analyzing and implementing the useof software rejuvenation within other applications such aslarge web-server pools, but results are not yet available.

Main contribution and outlineThe main contribution of this paper is the developmentof a methodology for proactive management of softwaresystems which are prone to aging, and specifically toresource exhaustion. The application of softwarerejuvenation for cluster systems is by itself a novelcontribution. We have designed and developed arejuvenation agent for IBM Director which manages ahighly available clustered environment. Several algorithmsfor prediction of resource exhaustion, which is animportant component of this methodology, are discussed.Rejuvenation incurs some overhead in terms of bothdowntime and cost, and if done more often than necessarywill result in higher downtime and/or cost. Hence,stochastic models of the cluster system are developedand analyzed for some rejuvenation policies. For thetime-based policies, we demonstrate the effect of therejuvenation interval (defined as the time betweensuccessive rejuvenations) on the steady-state expecteddowntime and cost. We then obtain the optimal value ofthis interval, which minimizes the downtime and cost foran assumed set of parameter values. These models arevery general and can capture a multitude of cluster systemcharacteristics, failure behavior, and performabilitymeasures.

The remainder of this paper is organized as follows.Section 2 discusses previous related work and outlinesthe novel aspects of our work. The xSeries SoftwareRejuvenation Agent, the architecture of IBM Director,and how rejuvenation is performed in this framework, aredescribed in Section 3. The various statistical algorithmsused for prediction are also discussed in this section.Section 4 deals with stochastic models for cluster systemsand analysis of some rejuvenation policies. The modelscapture several cluster system characteristics, andrejuvenation is performed in a cluster environment.Since rejuvenation incurs an overhead, these models arehelpful in comparing several rejuvenation policies andoptimizing them for maximum benefit in terms of costand availability. Section 5 discusses experimental studiesof software aging for some common applications.Empirical data on their aging characteristics are obtainedand studied. These studies are essential in understandingimportant parameters to monitor and in developingapplication-specific rejuvenation strategies. Conclusionsand possible extensions to our work are discussed inSection 6.

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

313

2. Related workGarg et al. [11] present a general methodology fordetecting and estimating trends and times to exhaustion ofoperating system resources due to software aging. In theirwork, data on system activity and resource usage wascollected at regular intervals using a simple networkmanagement protocol (SNMP)-based tool. Whereasin [11] only time-based trend detection and estimation ofresource exhaustion are considered, [12] takes the systemworkload into account for building a model. Other work inmeasurement-based dependability evaluation is based onmeasurements made either at failure times [13–15] or aterror observation times [16 –18]. In [11] and [12], thesystem parameters are monitored continuously, since theevaluation is interested in trend estimation and not ininterfailure times or identifying error patterns. Some ofthe above papers deal with hardware failures, while [11]and [12] are concerned solely with software failures—in particular, failures due to resource exhaustion.

Iyer and Rossetti [14] study the effect of systemworkload on failures through a measurement-basedanalysis and report significant correlations betweenpermanent failures and increased system activity. Amethodology for recognizing the symptoms of persistentproblems is proposed in [16], which identifies andstatistically validates recurring patterns among errorrecords produced in the system. In [19], systemparameters are constantly monitored to detect anomaliesautomatically. This procedure is based on the premisethat an anomaly or a deviation from “normal” behavior isusually a symptom of error. This study focuses on networkfailures, and smoothing techniques are applied to build anormal behavior profile.

Several analytical studies have assessed the effectivenessof software rejuvenation. In [3, 7, 20 –22] only failurescausing unavailability of the software are considered,while in [23] only a gradually decreasing service rateof a software application which serves transactions isassumed. In [24], however, both these effects of aging areconsidered together in a single model. Models proposed in[3, 20, 21] are restricted to hypoexponentially distributedtime to failure. Our analysis is very similar in that it useshypoexponentially distributed time to failure for thesoftware; however, in [20], Markov regenerative theoryis used to solve the model, whereas we use an Erlangapproximation. The models proposed in [7, 22, 23] canaccommodate general distributions, but only for thespecific aging effect they capture. Generally distributedtime to failure, and the service rate being an arbitraryfunction of time, are allowed in [24]. It has beennoted [2] that transient failures are partly caused byoverload conditions, but only the model presented in [24]captures the effect of load on aging. In [25], an availability

model of a two-node cluster is described. Different failurescenarios are modeled, and the availability is analyzed.While that paper concentrates on hardware failures,and software rejuvenation is not considered, this paperconsiders rejuvenation with software failures only. Existingmodels also differ in the measures being evaluated andthe assumptions underlying the analysis. For example, in[7, 22], software with a finite mission time is considered,whereas we consider continuously running systems. In[3, 20, 21, 24], measures of interest in transaction-basedsoftware intended to run forever are evaluated; we analyzeavailability and cost. All previous models except [7] and[22] are just special cases of the model presented in [24].

In [25], an availability model of a two-nodemanagement scheduling and control system (MSCS) isdescribed. Different failure scenarios are modeled, andavailability is analyzed. A cluster system modeled withworking, failed, and intermediate states is describedin [26]. Reliability analysis is carried out for atelecommunication system application for a range offailure-rate and fault-coverage values. In [27], hardware,operating system, and application software reliabilitytechniques are discussed for the modeled cluster systems.Reliability levels in terms of fault detection, faultrecovery, volatile data consistency, and persistent dataconsistency are described. While the above papers modelhardware and software failures, and software rejuvenationis not considered, our modeling and analysis considerrejuvenation with software failures only.

3. The xSeries Software Rejuvenation AgentThe xSeries Software Rejuvenation Agent (SRA) wasdesigned to monitor consumable resources, estimatethe time to exhaustion of those resources, and generatealerts to the management infrastructure when the time toexhaustion is less than a user-defined notification horizon.The management infrastructure provides a graphical userinterface for the user to configure the SRA, and acceptsand acts upon the alerts as described below.

The SRA was designed according to a number ofground rules, with the prime objective of maximizingflexibility, portability, and customer acceptance. Formaximum generality and user acceptance, no modificationto the application is allowed, to support either failureprediction or rejuvenation. Similarly, no access to thekernel is allowed, in order to facilitate error containment,cross-OS portability, and customer acceptance. The agentmust use published and architected interfaces for dataacquisition, alerting, and rejuvenation in order tominimize sensitivity to gratuitous interface churn andprovide a product that is relatively stable across multiplegenerations of operating systems and applications. Theagent must be relatively portable across operating systemsto allow us to economically attack the different markets of

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

314

commercial interest to IBM. A simple user interface isrequired which contains a minimum number of tunableparameters and is easy to set up and understand. Finally,because in many cases we do not know in advance whichresources will be exhausted in the myriad environmentsin which we will be using SRA, the agent must be able toadapt to monitor new exhaustible parameters and executenew algorithms to predict exhaustion of those resources.These considerations caused us to partition the design intoan OS-dependent data acquisition subsystem, a portableanalytical subsystem with highly configurable inputparameters, an architected management interface, anda management infrastructure that spans multiple IBM-supported operating systems. The SRA is incorporatedas a component of IBM Director, which is discussednext.

Architecture and framework of IBM DirectorTivoli provides various systems management optionsfor large-size as well as medium-size companies. Theirtechnology is leveraged by xSeries for IBM Director andprovides the xSeries brand with the framework of a high-quality systems management toolkit, while allowing us theflexibility to develop value-add functionality such as theSRA. IBM Director is a three-tier environment (Figure 1)that supports this flexibility on top of a highly scalableinfrastructure. This cross-platform flexibility allows thexSeries brand to meet customer needs, whether they areusing a Microsoft** operating system, Novell**, OS/2*,SCO**, or Linux.

The three tiers of IBM Director are the console, server,and agent. The console provides a Java**-based interfacefor accessing the functionality (via a set of tasks) of theIBM Director environment. The server controls access tothe function, data, and agents for a given task, and theagent is the interface to a managed object, which in ourcase is a server or cluster of xSeries servers. Eventsprovide a notification mechanism from an agent into theIBM Director environment. IBM Director is designed tooperate in a client/server model and consists of thefollowing components:

● Management console: The IBM Director managementconsole is the graphical user interface (GUI) from whichadministrative tasks are performed. It is the primaryinterface with the administrator, and is used to configurethe SRA as described below. The management consoleGUI is Java-based, with all state information stored onthe server. It runs as a locally installed Java applicationin a Java Virtual Machine (JVM**).

● Management server: The management server is theplatform used for the central management server,where management databases, the server engine, andmanagement application logic reside.

● IBM Director agents: The agents reside on each managedsystem (such as an xSeries server) and act as passive,nonintrusive native applications. The SRA task thatcollects data, predicts resource exhaustion, andgenerates events runs as an agent.

In addition to the SRA function, IBM Director supportsa comprehensive set of tasks for agent nodes. These nodescommunicate directly with the IBM Director server,allowing numerous tasks to be performed, of which thefollowing short list is representative:

● Inventory: IBM Director discovers new managed systems,collects the appropriate information about these systems,and stores it in the inventory database. It can then beviewed through either a default or a customized view.

● Resource monitors: Resource monitors (Figure 2) enable theuser to view statistics and usage of critical resources on the

Figure 1

IBM Director framework.

Netfinity Director management network infrastructure

Task interaction Data repository Management element

ConsoleGUI (Java)

Server(database)

Agent #0system interface

(logic)

Agent #Nsystem interface

(logic)

. . .

Figure 2

Resource monitors.

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

315

network. Information can be collected and monitoredon attributes such as CPU, disk, memory, and network.SRA is a specialized instance of a resource monitor.

● Event management: Event management (Figure 3)enables the user to view a log of events that haveoccurred for a managed system or group of systems andto create event action plans to associate an event witha desired action, such as sending an e-mail, starting aprogram, logging to a file, or invoking rejuvenation.When the SRA has detected an impending resourceexhaustion, it generates events that can be viewed usingthis functionality.

Software rejuvenation in the IBM Directorenvironment

User interfaceSoftware rejuvenation is presented to the user as a highlystylized means to schedule rejuvenation. The user has at

his disposal the building blocks of a schedule and a set ofresources. The resources may be scheduled for time-basedrejuvenation or configured for exhaustion forecasting.

The SRA user interface is presented to the userin the form of a calendar (Figure 4) in which the userconceptually can see and manipulate rejuvenations. Toschedule a rejuvenation for a certain day, the user dragsand drops a node of the cluster from the left-hand side ofthe interface (e.g., “platini” and “zico” are servers withincluster “copamundial”) onto the day of the week whenrejuvenation is desired. A follow-up dialogue (Figure 5)then negotiates whether the user wishes the rejuvenationto occur daily, weekly, monthly, or on some other periodicbasis, and the time at which the rejuvenation is to occur.The user can designate certain days of the week asinvalid, when no rejuvenation can occur, and multiplerejuvenations are prohibited from being scheduled at thesame time. Among other rejuvenation options, the usercan engage cluster-specific logic designed to ensure that aproperly planned failover can occur. As shown in the left-hand side of Figure 6, the user can ask the rejuvenationlogic to confirm that at least one backup node exists whichcan handle the failover workload prior to rejuvenating(“check for one”), ensure that all backup nodes canhandle the workload (“check for all”), or rejuvenatewithout checking (“skip check”). If one of the first two

Figure 3

Event management.

Figure 4

Software rejuvenation calendar.

Figure 5

Scheduling rejuvenation.

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

316

options is selected and a backup node cannot be found,rejuvenation is postponed until the next opportunity.

The proactive option of exhaustion prediction isprovided to evoke an advanced and noninteractive schemefor scheduling resources for rejuvenation. The user can setup a stand-alone system or a clustered system for this levelof support. In the expert mode of operation for exhaustionprediction, the user can allow an application to schedulerejuvenation automatically without his interaction. Theuser configures a cluster or a node for predictioncapabilities by invoking a simple prediction configurationmenu (Figure 7), and selecting the notification horizonand the type of action desired when exhaustion ispredicted to occur within that horizon. Very fewother parameters are configurable by the user.

A notification mechanism, via the built-in IBMDirector’s event mechanism, allows for notification of anyimpending exhaustion through either a pop-up or tickertape. All notifications are done with the IBM Director event-driven notification mechanism. These events are used todrive the scheduling of an automatic rejuvenation, ofnotifications, and of other user-defined actions (such asrunning a remote program). The status is reported usingthe IBM Director-provided event log mechanism.

Agent designThe xSeries Software Rejuvenation Agent is responsiblefor monitoring the exhaustible parameters, predictingtheir exhaustion, and providing alerts to the managementinfrastructure. The data acquisition component of theagent is specific to the operating system. For Windows**,it reads the registry performance counters and collectsparameters such as available bytes, committed bytes,nonpaged pool, paged pool, handles, and logical diskutilization. (Interestingly, a memory leak in the system callrequired to obtain the performance counters required usto design the agent so that it could be rejuvenated.) ForLinux, the agent accesses the /proc directory structure andcollects equivalent parameters such as memory utilization,file descriptors, I-nodes, and swap space. All collectedparameters are logged on disk. They are also stored inmemory in preparation for the curve-fitting andconsumption extrapolation.

Prediction algorithmsIn the current version of SRA, rejuvenation can be basedon elapsed time since the last rejuvenation, or onprediction of impending exhaustion.

When using timed rejuvenation, the user interfaceof IBM Director is used to schedule and performrejuvenation at a period specified by the user. A calendarinterface allows the user to select when to rejuvenatedifferent nodes of the cluster, and to select “blackout”times during which no rejuvenation is to be allowed.

Although this sounds rather primitive, our analysis(presented below) shows that for typical clusters thatundergo aging, system availability can be improvedsignificantly via this technique.

Single-parameter predictive rejuvenation relies oncurve-fitting analysis and projection, using recentlyobserved data. The projected data is compared toprespecified upper and lower exhaustion thresholdswithin a notification time horizon. The user specifies thenotification horizon and the desired parameters (someparameters believed to be highly indicative are alwaysmonitored by default), and the agent automaticallyperforms the analysis.

The curve-fitting algorithm operates on a sliding windowof data spanning a temporal interval which is a fixedfraction (say, 1/3) of the notification horizon. For example,if the user wishes to be informed or have rejuvenationinvoked if exhaustion is projected to occur within, say,

Figure 6

Scheduling options.

Figure 7

Configuring for prediction.

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

317

three days, the data window is set to one day, and theanalysis and extrapolation over the three-day horizonare performed using that one day’s worth of data. Thesampling interval is selected to provide enough data pointswithin the fitting window to allow the prediction algorithmto adequately smooth the data and select an appropriateprediction function without overfitting the data.

The prediction algorithm fits several types of curves tothe data in the fitting window; these curves have beenselected for their ability to capture different types oftemporal trends. A model-selection criterion is appliedto choose the best prediction curve, which is thenextrapolated to the user-specified horizon. Severalparameters that are indicative of resource exhaustionare monitored and extrapolated independently. If anymonitored parameter exceeds the specified minimumor maximum value within the horizon, a request torejuvenate is sent to the management infrastructure. Inmost cases, it is also possible to identify the process thatis consuming the preponderance of the resource beingexhausted, in order to support selective rejuvenation,as described below.

Details of the curve fitting and model selection aregiven in Appendix A.

Rejuvenation granularityWhen an exhaustion has been predicted, the questionarises as to what portion of the environment should berejuvenated. The simple approach to rejuvenation is toperform a node reboot. A single computer node is adistinct and compartmentalized unit of the user’s resourceenvironment, and the cluster-management frameworkwithin which we are operating should handle such nodefailures quite adequately. The drawback is that an entirenode rejuvenation may be considered too broad in scope,and the time to reboot may be significant. Perhaps onlyone application on that system, only one of thatapplication’s services, or perhaps even a particularsubprocess of that service is the cause of the pendingexhaustion. The narrower the scope of rejuvenation,the smaller the disruption will be to the user’s businessobjective. From a functional point of view, the operatingsystem provides various APIs to perform rejuvenation atall of these various levels. Whether the resource to berejuvenated is at the system level, the application level,or even the service level, it is likely that the planneddowntime (i.e., the rejuvenation) is always less disruptivethan an unplanned node outage.

Because of these considerations, the Windows SRAoffers two levels of rejuvenation, depending on the scopeof the resource exhaustion (i.e., whether an operating-system-level exhaustion or an application-level exhaustionhas been identified). The desired level can be selected

using an advanced submenu of the user interfacedescribed above.

● Level 1 is a service-level rejuvenation. Generally, it canbe assumed that applications written as a service arewritten such that a stoppage of that service will saveany necessary data (both user and application) andthe corresponding restart of that service will bringthe application to a state of usability. In such cases, agraceful rejuvenation of that service can be performed.

● Level 2 is an operating-system-level rejuvenation (i.e.,a reboot). The reboot performs a stop for each serviceon that system and then reboots the operating system.Application failover and recovery in this case are theresponsibility of the cluster-management software,which is activated as part of the reboot process.

4. Modeling and analysisWe have developed several availability models to assessthe impact of time-based and predictive rejuvenationpolicies on cluster system availability. These models weredeveloped using stochastic reward nets (SRNs) [28]. SRNshave recently gained much importance and recognitionas a useful modeling formalism. They have been usedsuccessfully in many applications, since they can easilyrepresent the concurrency, synchronization, sequencing,and multiple resource possession that are characteristicsof current computer systems. SRNs also offer a high-level interface for the specification of Markov models.When the state space of Markov models is large, SRNscan be used to generate the underlying Markov chainautomatically. SRNs are obtained by specifying rewardrates at the net level. We have used SRNs for ourmodeling and analysis; an informal description of theirconcepts is given in [29].

System characteristicsWe consider an n-node cluster running cluster-management software. Each node has redundant networkconnections, and the entire cluster contains a redundantstorage subsystem. Both hardware and software failuresmay occur in the cluster, but in this paper we consideronly software failures. Failures due to applications,middleware, and operating system are not differentiated,and node failures are assumed to be independent of oneanother. We also allow common-mode failures whichmodel the failure of the cluster management software, andnonunity failure coverage for a node failure. All failuresand repair times are assumed to be exponentiallydistributed. However, the time-based policies userejuvenation intervals that are deterministic, and areapproximated using an Erlang distribution. Details ofthese models are given in Appendix B; only the resultsare presented here.

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

318

Analysis and resultsParameter values such as failure and repair rates usedfor the models are shown in Table 1 and explained inAppendix B. Although they fall within the range of existingsystems, the values used for this paper should be consideredfor illustration purposes only. To be able to detect clearly theeffect of the rejuvenation, we set the value of the common-mode failure rate (transition Tcmode) very close to 0 and thevalue of the node failure coverage (c1) very close to 1.Future analyses will explore the effect of rejuvenationpolicies on availability when these parameters are nonideal.All of the models were solved using the stochastic Petri netpackage (SPNP) tool [30, 31]. Our SRN model computes theexpected unavailability and the expected cost incurred perunit of time. The expected unavailability is computed as theprobability that the maximum tolerable number of nodes areinoperative, either because all have failed, or because ofsome undergoing rejuvenation, with others failing in themeantime. Cost can be incurred because of a node failure, anode rejuvenation, or a system failure (all nodes down). Foreach case, the cost per unit of time is multiplied by theexpected number of tokens in the corresponding places, andthe total cost is computed. It is assumed that the cost ofrejuvenation is much less than the cost of a node or systemfailure, since rejuvenation can be done at predetermined orscheduled times (for example, when the system load is low).

We define punav and cfail to be the steady-state expectedunavailability and the expected cost incurred per unit oftime, respectively. Over a given time interval T, theexpected downtime can be computed as T 3 punav and theexpected cost incurred during that interval as T 3 cfail .For concreteness, we rather arbitrarily fix the value of T at1000 hours. By using these measures, optimal rejuvenationintervals can be obtained for different policies andconfigurations. The models also help us compare onerejuvenation policy with another. For example, we may notobtain the same rejuvenation interval, which minimizesboth unavailability and cost. Therefore, there is a tradeoffinvolved, and it is up to the user/operator to decide whathe/she considers important. One could also optimize therejuvenation interval based on other performability(combined performance and availability) measures, such asthe mean number of nodes operational at a given instant.

All of the costs are formulated as costs per unit of time.In our analysis, we fix the value for costnodefail at $5000/hrand vary the ratio costsysfail/costrejuv. The value of costsysfail

is computed as the number of nodes, n times costnodefail

(although a total system failure can potentially cost muchmore than a node failure). Unless mentioned otherwise,we fixed the costsysfail/costrejuv ratio to be 20.

Finally, we analyze the rejuvenation policies for variouscluster configurations. A cluster configuration is denotedby n/m, which means that the cluster has a total of n nodesand can tolerate at most m node failures; i.e., the cluster isconsidered unavailable if more than m nodes are unavailable.

Results for time-based rejuvenationFigure 8 shows the plots for an 8/1 cluster (i.e., an eight-node cluster that can tolerate at most one node failure)employing simple time-based rejuvenation. The upperand lower plots show the expected cost incurred and theexpected downtime (in hours) respectively in a given timeinterval, versus rejuvenation interval in hours. Whenevaluating rejuvenation, we typically see J-shaped curves,with a region of high unavailability when the rejuvenationinterval is very small (the system is always rejuvenating,so any other node failure while rejuvenating results in asystem outage) and another region of high unavailabilitywhen the rejuvenation interval is very large (the systemessentially never rejuvenates, and the effects of aginghave their full impact). Between these extrema lies arejuvenation interval that results in the minimum

Figure 8

Simple time-based rejuvenation for 8/1 configuration. This plot showsthe expected cost and downtime incurred, over a period of 1000 hours,for an eight-node cluster that can tolerate at most one simultaneousnode failure.

0 100 200 300 400 500 600

0 100 200 300 400 500 600Rejuvenation interval (hr)

Rejuvenation interval (hr)

2.2

2.0

1.8

1.6

1.4

1.2Exp

ecte

dco

st(�

10�

4)

1.05

1.00

0.95

0.90

0.85Exp

ecte

ddo

wnt

ime

Table 1 Parameter values.

Transition Rate

l1 1/240 /hrl2 1/720 /hrl3 1/30 /minl4 1/4 /hrl5 1/10 /min

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

319

downtime and cost. For the 8/1 configuration, theminimum for the cost occurs at 100 hours, and theminimum for the downtime occurs at 180 hours. Thus, ifone wishes to invoke timed rejuvenation, a compromisehas to be made regarding the rejuvenation interval basedon whether the expected downtime or the expected costincurred is more important.

Figure 9 shows similar plots for an 8/2 configuration.In this case, a shallow minimum for the cost occurs at80 hours, and a sharp minimum for the downtime occurs

at one hour. Here we take advantage of the fact that thesystem can tolerate up to two node failures at the sametime, and therefore we can afford to rejuvenate morefrequently. Hence, the optimal cost and downtime arealso substantially reduced.

A significant figure of merit for a rejuvenated system isthe unavailability improvement due to rejuvenation, whichwe define to be the ratio between the expected steady-state unavailability with rejuvenation and the unavailabilitywithout rejuvenation. A plot of unavailability improvementversus rejuvenation interval is shown for differentconfigurations in Figure 10. In general, there is amuch larger improvement in unavailability for the X/2configurations (that is, X-node clusters that can toleratetwo failures before becoming unavailable) than that forX/1 configurations. For the 8/2 and 16/2 configurations,the optimal rejuvenation interval occurs very close to zero.But as the rejuvenation interval increases, they deterioraterapidly, and the improvement becomes less than that forthe X/1 configuration.

Next, we studied the effect of the costsysfail/costrejuv ratioon the rejuvenation interval. This was done for the 8/1configuration and is shown in Figure 11. As the cost ratiois increased (i.e., rejuvenation becomes cheaper andcheaper relative to a system outage), the optimum valueof the rejuvenation interval d decreases and the overallexpected cost also decreases. As d 3 `, there is norejuvenation, so the total cost ceases to be a function ofthe rejuvenation cost. In this case, all of the differentcases approach the same value.

Figure 9

Simple time-based rejuvenation for 8/2 configuration.

0 100 200 300 400 500 600

0 5 10 15 20 25 30 35 40Rejuvenation interval (hr)

Rejuvenation interval (hr)

2.5

2.0

1.5

1.0

0.5

0Exp

ecte

dco

st(�

10�

5)

1.21.00.80.60.40.2

0

Exp

ecte

ddo

wnt

ime

(�10

3 )

Figure 10

Unavailability improvement for various configurations employing sim-ple time-based rejuvenation.

0 100 200 300 400 500 600

Rejuvenation interval (hr)

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0

UA

with

reju

v/U

Aw

ithou

trej

uv

2/18/18/216/116/2

Figure 11

Effect of node failure/node rejuvenation cost ratio for 8/1 configurationemploying simple time-based rejuvenation.

0 100 200 300 400 500 600

Rejuvenation interval (hr)

7

6

5

4

3

2

1

0

Exp

ecte

dco

st(�

10�

4 )

5102050100

Cost ratio

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

320

Results for prediction-based rejuvenationIn our model of prediction-based rejuvenation, a nodeis rejuvenated as soon as it enters a state in whichit has a high failure rate, instead of after a specified inter-rejuvenation time interval. To model imperfect prediction,our analysis includes a “prediction coverage” factor, whichis the probability that the failure predictor successfullydetects that a node is likely to fail, given that it has justentered the high-failure-rate state. (We do not model thecase in which the failure predictor emits a false alarmand triggers an unneeded rejuvenation, although this isstraightforward. We think that this phenomenon has aminor effect on unavailability for reasonable false-alarmprobabilities.) The plot in Figure 12 shows the transientvalues of expected downtime over a 5000-hour operationalinterval, for the 8/1 configuration. The steady-state values,achieved after approximately 2000 hours of operation, arefound toward the right of the chart. The graphs show thevalues for different values of prediction coverages. Aswe would expect, the higher the prediction coverage,the lower the expected downtime. At 100% predictioncoverage, the data lies along the x-axis and is practicallyinvisible. We compare these steady-state values with thetime-based policy in the following section.

Summary of analytical resultsThe data in Table 2 shows the ratio of downtime withrejuvenation to downtime without rejuvenation, fortime-based and predictive rejuvenation. For time-based rejuvenation, a rejuvenation interval is used whichminimizes downtime, which was approximately 100 hoursfor the one-spare clusters, and approximately one hour forthe two-spare clusters. For predictive rejuvenation, it isassumed that aging can be successfully predicted 90% ofthe time. If we assume that aging can be predicted 100%of the time, downtime drops to practically zero, but wedo not feel that perfect predictive coverage is achievablein practice.

The analysis results lead to some interestingconclusions. First, for systems that have one spare, andregardless of the number of nodes in the cluster, time-based rejuvenation can reduce downtime by about 25%compared with no rejuvenation. Predictive rejuvenationdoes somewhat better, reducing downtime by about 60%compared with no rejuvenation. However, when the system

can tolerate more than one failure at a time, downtimeis reduced by 98% to 95% via frequent time-basedrejuvenation, compared to a mere 85% for predictiverejuvenation. We believe that this is because, with high-frequency time-based rejuvenation, a node spends verylittle time in the “failure-prone” state before it isrejuvenated back to the low-failure-rate state. Forpredictive rejuvenation with only 90% coverage, a nodewhose aging has gone undetected remains in the “failure-prone” state until it actually fails; as we currently modelit, there is no second chance to be rejuvenated. Therefore,there is a higher probability that one node can have failed(or is being rejuvenated), a second is being rejuvenated(or has failed), and a third node fails, having escapedaging detection. We point out that, at the high levels ofavailability predicted by our model for the two-spareclusters, it is likely that nonideal effects such as singlepoints of hardware and software failure and nonunityfailover coverage would dominate the effects ofrejuvenation we are predicting here. However, this isprobably not the case for the one-spare configurations.

Figure 12

Prediction-based rejuvenation for 8/1 configuration for different pre-diction coverage values.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time (hr)

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Exp

ecte

ddo

wnt

ime

(hr)

0.70.80.91.0

Table 2 Ratio of downtime with rejuvenation to downtime without rejuvenation.

Rejuvenationpolicy

Configuration

2/1 8/1 16/1 8/2 16/2

Time-based (optimal rejuvenation interval) 0.74 0.74 0.76 0.02 0.05Prediction-based (90% prediction coverage) 0.38 0.38 0.39 0.15 0.15

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

321

5. Experimental results

Empirical measurement of resource exhaustionIn the process of developing and testing the SRA, it wasnecessary to confirm the existence of aging, understandwhich parameters were most likely to age, and learn howto measure and predict resource exhaustion for certainapplications of commercial interest to IBM. For thisreason, several important applications were put under highworkload and their aging characteristics were measured.The chart in Figure 13 shows typical results, in which onesuch large-database application running on the Windows

operating system consumed 100% of the committed bytes(e.g., page-file bytes plus physical memory) over a ten-hour interval, beginning at an initial utilization level of45%. Other resources were consumed, but the committedbytes resource was finished off first. (In the interestof accelerating the test, the database was presentedwith a workload that was much higher than would beencountered in a well-run production environment.Anecdotal data indicates that in a real-world environment,exhaustion for this application requires more than aweek.) For this particular application, once the committedbytes resource was consumed, the database was unable toproceed, and usually panicked or hung. Once the resourcewas consumed and the panic had occurred, a full systemreboot and database reconstruction was required to bringthis particular database back on line. An interestingresult of our testing regime was the discovery that of theapplications that did show aging, the same kinds of resourceswere usually consumed. This implies that the ability topredict or detect a relatively small set of failure signaturesis probably sufficient to cover an adequately large set ofapplications, for practical purposes.

“Bad boys”Large applications are difficult and time-consuming to setup and test. In the interest of accelerating the test processand fine-tuning and testing our algorithms, we alsogenerated several applications that were prodigiousconsumers of disparate resources. Figure 14 showshow one such “bad boy” application, again runningon Windows, consumes committed bytes, starting at800 seconds into the experiment. (We have other badboys that consume processes, threads, nonpaged andpaged pool memory, semaphores, handles, etc.) Forpurposes of testing, the notification horizon was setto 7200 seconds, and the sampling rate and resourceconsumption rates were scaled appropriately. In practice,much larger time frames would be used. This particularfigure also shows how the piecewise regression of the SRApredictive agent compensates for the startup transient resourceconsumption, and predicts the time to exhaustion for thisscenario. The resource-exhaustion upper limit of about400 MB is shown as a dashed line at the top of the plot.Figure 15 shows how another bad boy application thatcontinually opens files but forgets to close them consumesI-nodes on the Linux operating system. Again, testingis accelerated, and the notification horizon is set to3600 seconds; the lower limit is somewhat arbitrarilyset to 500 I-nodes.

6. Conclusions and extensionsWe have developed, analyzed, and implemented aframework for detection, prediction, and proactivemanagement of software aging. This technology is

Figure 13

Committed bytes versus time (in hours) for a large database application.

0 1 2 3 4 5 6 7 8 9 10

120

100

80

60

40

20

0

Com

mitt

edby

tes

cons

umed

(%)

Elapsed time (hr)

Upper limit

Figure 14

Committed byte consumption versus time (in seconds) for a leakyapplication in Windows, with exhaustion prediction.

450,000,000

400,000,000

350,000,000

300,000,000

250,000,000

200,000,000

150,000,000

100,000,000

50,000,000

0 2000 4000 6000 8000

Com

mitt

edby

teco

nsum

ptio

n

Elapsed time (s)

Sampled data

Fitted and predicted data

Upperlimit

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

322

applicable to a wide range of operating environments,and has been implemented in the xSeries SoftwareRejuvenation Agent. It has been commercially availableon xSeries servers since the end of 2000. Our cost andavailability models indicate that rejuvenation significantlyimproves cluster system availability and reduces downtimecost.

We have considered several extensions and otherapplications of software failure prediction and staterejuvenation. This section outlines a few of our thoughtsin this area.

IP dispatchingAs part of a web-hosting framework, an IP-dispatchingor load-balancing component is often present to providescalability, availability, and load-balancing capabilities forTCP/IP applications. Early implementations consisted ofa domain name server (DNS) that would translate hostnames into IP addresses for corresponding servers, so thatIP requests are routed in a round-robin fashion to a poolof servers. This approach is often called round-robin DNS;an example configuration is shown in Figure 16. Note thatIP dispatching can be considered a specialized form ofclustering, such that one or more nodes execute thedispatching components and the remaining nodesexecute web-serving applications.

More advanced approaches are now available, such as theIBM Secureway Network Dispatcher [32]. This productprovides enhanced IP-level load-balancing mechanismsand content-based routing, as well as improvedmanagement and availability functions. Figure 17

illustrates the network dispatcher operations for aLAN implementation. As part of load balancing, thedispatcher’s scheduling policy is dynamically basedon each server’s load and availability. This is partlyaccomplished by having each server send periodicutilization information to the dispatcher. This utilizationinformation can easily be augmented by healthinformation in the form of time until resource exhaustion,degree of resource exhaustion or, in its simplest form,time remaining until a timed rejuvenation. This healthinformation can be sent to the dispatcher in order toschedule actions for individual servers. The scheduling ofthese actions can also take into account aggregate loadingof the web host.

Figure 15

I-node consumption versus time (in seconds) for a leaky application inLinux, with exhaustion prediction.

4000

3500

3000

2500

2000

1500

1000

500

00 1000 2000 3000 4000

Elapsed time (s)

Sampled data

Fitted and predicted data

Lowerlimit

Ava

ilabl

eI-

node

s

Figure 16

Example configuration of round-robin DNS.

Server 1

Server 2

Server 3

Server 4

Server N

ClientUserDNS

2

Intermediatename

servers

1'1

3'

4

5

3

Step 1: Address request (URL)Step 1' : Address request reaches DNSStep 2: (Web-server IP address, TTL) SelectionStep 3: Address mapping (URL => address 1)Step 3' : Address mapping (URL => address 1)Step 4: Document request (address 1)Step 5: Document response (address 1)

Figure 17

Network dispatcher for a LAN implementation.

Server 1

Server 2

Server 3

Server 4

Server N

Client

LANnetwork

dispatcher(IP-SVA)

2

User1

4

3

Step 1: Document request (IP-SVA)Step 2: Web-server selectionStep 3: Packet forwardingStep 4: Document response (IP-SVA)

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

323

Solution centerOne of the core pieces of the software rejuvenationtechnology is a monitoring tool that has the ability todetect and identify misbehaving software components.In addition to this technology’s application in productionclustering applications, it also has use as a test tool tohelp make software more reliable. This type of tool couldbe useful in a system integration facility such as an IBMsolution center, where software of disparate quality isintegrated onto a common platform. (We have first-handexperience of this. The ability of the agent to detect andpinpoint resource consumption was useful in developingand testing the agent itself, when some of the system callsused to monitor resource consumption caused resourceleaks themselves.)

Discrimination between hardware and softwarefaultsWhen a system crashes, it can be difficult to determinewhether the crash was due to hardware or software,especially if a subsequent system reboot is successful.Frequently, hardware is identified as the culprit when itis associated with a certain number of outages, perhapsbecause the service technician has to do somethingthat is perceived as solving the customer’s problem.Consequently, nonfaulty hardware is often returned toIBM under a service contract, at a significant cost to thecorporation. We think that a variant of the software-monitoring agent we have developed can provide valuableclues to the user or technician as to whether the crashwas due to a software problem it is capable of detecting,possibly reducing the number of no-fault-found hardwarereturns and IBM service costs. It could also be expandedto monitor hardware errors to improve its diagnosticresolution, since it is fairly well known that permanenthardware failures are often preceded by an increasingrate of occurrence of transient hardware failures.

Adaptive and multiparameter predictive capabilitiesThe current version of SRA was developed on the basis ofa preconceived notion of the types of resources that canbe exhausted and their exhaustion thresholds, for a givenset of operating systems. During development, thishypothesis was supported by a testing and validation effortfor several applications of importance to IBM, and wasdeemed valid. Consequently, the SRA is capable ofmonitoring and predicting the exhaustion of any singleparameter that is on a fairly static initialization list,using a flexible curve-fitting methodology. We think thiscapability is adequate for a sufficiently large number ofapplications to make the product useful. However, it is notknown in general which parameters may be exhausted orenter critical regions in all scenarios, nor what theirexhaustion thresholds are. An outage may occur only when

a combination of parameters reaches critical regions, or itmay be the case that a given parameter or combination ofparameters does not have to be at extrema to constitute ahazardous situation. For example, we noticed during ourtesting of a web-serving application that just prior to theoutage, the committed bytes, nonpaged-pool bytes, andavailable memory approached known exhaustion limits,as expected. However, other parameters also repeatedlyexhibited unusual and indicative behavior just prior to theoutage: For example, the variance of the paging rate andthe number of nonpaged-pool allocations skyrocketed,although these do not seem to be outage precursors ontheir own. Therefore, we think that a truly general outagepredictor must have the capability to characterize possiblycomplex multiparametric conditions prevalent just priorto outages that occur in a given application, store thischaracterization in an analytically tractable form, anddevelop the ability to predict a system’s subsequentapproach to that region in multiparameter space.

Availability modeling and analysisAs part of the future work in modeling and analysis,we could introduce new cluster-system characteristicsand failure behavior and improve the models. Wecould analyze these models for many more differentconfigurations than have been explored in this paper.Other new and interesting performability measures couldbe introduced for the analyses. To obtain more accurateresults using the SRN models, we could use the theory ofMarkov regenerative processes (MRGP). Another possiblesolution method for general non-Markovian models isdiscrete-event simulation. The models could be improvedto consider hardware components and failures and discusstheir impact on system availability and performance. Newrejuvenation policies could be formulated based not juston time, but also on the load of the system (instantaneousor cumulative) [24].

Appendix A: Exhaustion-prediction algorithmsThis appendix is devoted to the description of theextrapolation and model-selection algorithms forpredictive rejuvenation. The steps of the predictionprocedure are as follows:

● Preprocessing the data.● Fitting several models to the preprocessed data.● Selecting the best model.● Forecasting the data behavior with the selected model.

Let X1 , . . . , Xm denote the observations in the training(fitting) window, and let T1 , . . . , Tm be the correspondingsampling times. Using a median filter, the data ispreprocessed to produce n medians Y1 , . . . , Yn , which weassociate with “sampling times” t1 , . . . , tn . In particular,

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

324

let the median filter operate on k samples (say k 5 5);then, assuming m 5 n 3 k, Y1 5 median(X1 , . . . , Xk),Y2 5 median(Xk11 , . . . , X2k), etc., and the sampling timesof the medians are defined as t1 5 median(T1 , . . . , Tk),t2 5 median(Tk11, . . . , T2k), etc. Median filtering producesa smoothed version of the signal and is quite robust withrespect to the presence of spikes in the time series.

Our goal is to select a parsimonious model thatadequately describes the data. To that end, we considera relatively small number of model classes, pick the bestmodel from each class (i.e., fit the model to the data), andthen, from among those, select the “best” overall model.

To pick the best model from a parametric class M(for example, M could be the set of linear modelsYi 5 ati 1 b 1 e i , where the parameters are a and b),we select the parameter values that optimize a goodness-of-fit criterion. In particular, we minimize the residual sumof squares

RSSM 5 Oi51

m

~Yi 2 Yi!2,

where Yi is the value produced by the model at time ti .Minimizing the residual sum of squares for the classes ofmodels used in the system is computationally efficient.

To ensure a rich enough selection, we use the followingclasses of fitting models:

● Simple linear regression: Yk 5 atk 1 b 1 ek . The numberof parameters is p 5 2.

● Linear regression with h breakpoints (where we use h 5 1,as in Figure 18, or h 5 2, as in Figure 19): Given a set ofh 1 2 time instants t1 5 c0 , c1 , . . . , ch , ch11 5 tm

(i.e., the k breakpoints and the endpoints of the analysiswindow), Yk 5 ajtk 1 bj 1 ek if cj21 # tk # cj , wherewe constrain the coefficients aj and bj so that theoverall fitted piecewise-linear curve is continuous.Given the time instants {ci}, the fitting problem can beformulated as a linear regression and is easily solved.The problem, however, is not convex in the selection ofthe breakpoints c1 , . . . , ck . To keep the model-fittingproblem computationally efficient, we constrain thetimes of the breakpoints to occur at a small number Bof possible locations (say, B 5 10) through the fittingwindow. To avoid overfitting of the data when a spikeor a jump occurs right at the end of the fitting window,making the extrapolation unreliable, we also do notallow breakpoints near the end of the window. Theselection algorithm operates by fitting a model tothe data for each valid selection of breakpoints, andchoosing the model with the minimum RSS. Note thatif h 5 1, the number of parameters p equals 4 (fora given breakpoint, the piecewise-linear curve is defined

by three points, so add one more parameter for thelocation of the breakpoint). Similarly, if h 5 2, p 5 6.

● Linear regression on the logarithm of the data: log(Yk) 5

atk 1 b 1 ek . The number of parameters is p 5 2.

Figure 19

Result of fitting the smoothed data (solid curves connecting � marks)with a piecewise linear regression with two hinge points. The verticallines denote the limits of the fitting window. The regression fit, withinthe fitting window, is denoted by a solid curve, and the prediction isits dash–dot right continuation. The original unsmoothed data is alsoshown as a dash–dot curve.

1.228 � 107

1.227 � 107

1.226 � 107

1.225 � 107

1.224 � 107

1.223 � 107

1.222 � 107

1.221 � 107

1.220 � 107

1.219 � 107

80 100 120 140 160 180 200 220

Column 22

Memory, noname, pool nonpaged bytes

SignalEnd estimationStart estimationSmoothed signalThree-segment regressionPredicted

Figure 18

Result of fitting the smoothed data (solid curves connecting � marks)with a piecewise linear regression with a single split. The verticallines denote the limits of the fitting window. The regression fit, withinthe fitting window, is denoted by a solid curve, and the prediction isits dash–dot right continuation. The original unsmoothed data is alsoshown as a dash–dot curve.

2.6 � 109

2.4 � 109

2.2 � 109

2.0 � 109

1.8 � 109

1.6 � 109

1.4 � 109

1.2 � 109

80 100 120 140 160 180 200 220

Column 12

Memory, noname, committed bytes

SignalEnd estimationStart estimationSmoothed signalHinge regressionPredicted

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

325

● Piecewise-linear regression with k breakpoints on thelogarithm of the data: log(Yk) 5 ajtk 1 bj 1 ek ifcj21 # tk # cj . If h 5 1, as in Figure 20, p 5 4,and if h 5 2, p 5 6.

In choosing the “best” overall model, we wish to avoidoverfitting the data. In classical statistics, model selectionmethods, such as Akaike’s information criterion (AIC)[33] or Mallows’ Cp index [34], combine a goodness-of-fitmeasure with a penalty that grows with the order (numberof parameters) of the model. Mallows’ Cp method isparticularly simple to apply. Compute

Cp 5 RSSM/s 22 ~m 2 2p!,

where p is the number of parameters in the model, s 2 isthe residual mean-square error from the largest model(which is assumed to be a good estimate for Var[e i]),and m is the number of samples in the smoothed signal.The model with the smallest value of Cp is chosen.

In addition to Mallows’ Cp , we use a similar heuristicapproach, which is less dependent on distributionalassumptions and has several characteristics tailored tothe situation at hand. The idea is that, while models aretrained on the entire data in the window, the selectioncriterion should give greater weight to how well eachmodel fits the most recent data. Then define the weightedresidual sum of squares as follows:

WRSSM 5 Oi51

m

wi~Yi 2 Yi!2,

where wi is the weight assigned to the fit of sample i; forexample, wi 5 rm2i gives decreasing weight to observationsfurther in the past. If r 5 0.95, the error in fitting themost recent observation is roughly three times moreimportant than the error in fitting the 20th more recentobservation. Computing the appropriate exact modificationto the Cp index is difficult in this situation, so we adoptan intuitive heuristic which does not consider a class ofmodels with an additional parameter unless it reduces theWRSSM by at least a certain percentage. Specifically, if p0

is the minimum number of parameters used (with linearregression, two parameters are estimated, hence p0 5 2)and if WRSSM( p) is the weighted residual sum of squares,the selection criterion picks the model with the minimumWp , where

Wp 5 WRSSM~ p! 3 ~1 1 d! p2p0.

We heuristically select d to be 0.10. Then, each addedmodel parameter must decrease WRSSM( p) by at least10%.

Thus, in each window, six possible model families areconsidered (linear, linear with one breakpoint, linear with

Figure 20

Result of fitting the logarithm of the smoothed data (solid curvesconnecting � marks) with a piecewise linear regression with a singlehinge point. The vertical lines denote the limits of the fitting window.The regression fit, within the fitting window, is denoted by a solidcurve, and the prediction is its dash–dot right continuation. The originalunsmoothed data is also shown as a dash–dot curve.

2.0 � 108

1.8 � 108

1.6 � 108

1.4 � 108

1.2 � 108

1.0 � 108

0.8 � 108

0.6 � 108

0.4 � 108

0.2 � 108

080 100 120 140 160 180 200 220

Column 14

Memory, noname, available bytes

SignalEnd estimationStart estimationSmoothed signalHinge regression on logPredicted

Figure 21

Instance in which the selection criteria pick different models. The cri-terion based on Mallows statistics selects a simple linear regression,while the criterion based on multiplicative penalty (1 � � per degree offreedom) fits piecewise regression with two breakpoints to the log ofthe data.

80 100 120 140 160 180 200 220Column 18

Memory, noname, page input/page reads

SignalEnd estimationStart estimationSmoothed signal

Mallows: simple linear regressionPredictedPenalty: log piecewise linear regressionPredicted

5.0

4.5

4.0

3.5

3.0

2.5

2.0

1.5

1.0

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

326

two breakpoints, log linear, log linear with one breakpoint,and log linear with two breakpoints), and either the modelwith minimum Cp or that with minimum Wp is chosen.Experiments show that when no breakpoints are allowedin the most recent part of the fitting window, the Cp andWp criteria usually yield the same results. Only rarely aredifferent models selected: In particular, the criteria tendto disagree when no trend is clearly discernible, and thedata looks very noisy (Figure 21).

Prediction of the future behavior of the data is thenperformed by extrapolating the selected model. Notethat each of the six model classes can be cast as a linearregression. Therefore, it is also possible to projectapproximate confidence limits out into the future. Theseintervals have the following desirable properties:

● They get wider the further into the future the projectionextends.

● Noisier data results in wider intervals.● The piecewise-linear or piecewise-log-linear models become

wider the closer the last breakpoint is to the present.● The predicted curve is monotone into the future.

Roughly speaking, the computational requirement permonitored quantity is proportional to m 3 6 3 B 2/ 2,where m is the number of medians, B is the number ofallowed breakpoint locations, and the factor 6 comesfrom the six model classes.

Appendix B: Analysis using stochastic rewardnetsThis appendix details our approach of calculating clustersystem unavailability and cost using stochastic reward nets.

Analysis parametersTable B1 shows the main parameter values that are usedin the models. l1 is the rate at which a node transitionsfrom a nonfailure-prone state to the failure-prone state,and l2 is the rate at which a node fails once it has enteredthe failure-prone state. l3 is the rate at which a node isrepaired when it has suffered an unplanned node failure,l4 is the rate at which a system is repaired after it has

suffered an unplanned system outage, and l5 is the rate atwhich a node is rejuvenated once the need to rejuvenatehas been established.

Basic cluster systemFigure 22 shows the basic model of our cluster system.The cluster consists of n nodes. Initially, all of thenodes are in a “robust” working state, indicated by tokensin place Pup, in which the probability of node failure iszero. As time progresses, each node eventually transitsto a “failure-probable” state (place Pfprob) through thetransition Tfprob. The nodes are still operational in thisstate but can fail (transit to place Pnodefail1 with a nonzeroprobability). If a node crashes, it can recover with aprobability c through the transition Tnoderepair. In this case,the node goes back to the place Pup which represents theclean state of the node. The node recovery can fail witha probability (1 2 c), leading to a system failure (all nnodes are down). Place Psysfail represents this system-levelfailure state.

Thus, the time to failure for the node starting in therobust state Pup has a hypoexponential distribution [35].Since this is an increasing-failure-rate distribution, itmodels software aging. From a full system outage, thesystem can be repaired through the transition Tsysrepair andall the n nodes return to place Pup. Many applicationsmay require a minimum number of cluster nodes to be up

Figure 22

Basic cluster system.

Pup

Pfprob

Psysfail

Pnodefail2

Tnoderepair

TnodefailTimmd1

Timmd2

n

Pnodefail1

Tsysrepair

Timmd4

g1

g1

Tfprob

Tcmode

n

g1

Timmd6

Timmd5

g2

Timmd3g3

(1 � c)

c

#

#

Table B1 Parameter values.

Transition Rate

l1 1/240 /hrl2 1/720 /hrl3 1/30 /minl4 1/4 /hrl5 1/10 /min

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

327

for the service to be available. In these cases, the systemis considered down when there are a (a # n) individualnode failures. This is modeled by using the guardfunction g3 , which checks for this condition. In additionto individual node failures, there is also a common-modefailure, which takes place when transition Tcmode fires.The common-mode failure causes all nodes that areoperational at that instant to be down. The transitionrepresenting this failure is enabled at all times the systemis in the up state (guard function g2). When there is asystem crash (i.e., there is a token in place Psysfail), all ofthe remaining tokens in places Pup, Pfprob, and Pnodefail2 aredrained. This is accomplished by defining a guard functiong1 which enables the immediate transitions T immd4, T immd5,and T immd6 if there is a token in place Psysfail.

Simple time-based rejuvenationThe SRN model for a simple time-based rejuvenationpolicy for a cluster system is shown in Figure 23. In thispolicy, rejuvenation is done simultaneously for all of theoperational nodes in the cluster, at periodic deterministicintervals. The rejuvenation interval is determined by usinga clock. Initially, there is a token in the place Pclock. Thedeterministic transition Trejuvinterval fires every d time unitsand deposits a token in place Pstartrejuv. Rejuvenation begins

only when there is a token in this place. If there is a tokenin place Pstartrejuv and there are nodes to be rejuvenated(i.e., there are tokens in places Pup or Pfprob), immediatetransitions T immd8 and T immd9 are respectively enabled. Onlyone node can be rejuvenated at a time, and nodes areallowed to fail when another node is being rejuvenated.To model this, only one token is transferred from placesPup and Pfprob. The probability of selecting a token fromthem is directly proportional to the number of tokens ineach. Weight functions n1 (the number of tokens in Pup)and n2 (the number of tokens in Pfprob) ensure this. Sinceonly one node can be rejuvenated at a time, a token isdeposited either in place Prejuv1 (from place Pup) or inplace Prejuv2 (from place Pfprob). Inhibitor arcs from theseplaces to transitions T immd8 and T immd9 prevent more thanone token from coming in. Transitions Trejuv1 and Trejuv2 arethe transitions for rejuvenation from places Prejuv1 andPrejuv2, respectively. After a node has been rejuvenated, itgoes back to the “robust” working state, represented byplace Prejuved. This is a duplicate place for Pup in order todistinguish the nodes which are waiting to be rejuvenatedfrom the nodes which have already been rejuvenated. Anode, after rejuvenation, is then allowed to fail with thesame rates as before rejuvenation, even when another nodeis being rejuvenated. Hence, duplicate temporary places

Figure 23

Cluster system employing simple time-based rejuvenation.

Trejuvinterval

Pclock

Pstartrejuv

Trejuv1

Prejuv1 Prejuv2

Trejuv2

Timmd15Timmd7

Timmd10

Timmd12

g1g1

Tnodefailrejuv Pfprobrejuv

# #

Tfprobrejuv

Prejuved

Timmd8 Timmd9

prob = n2prob = n1

g1

Timmd13

Timmd14

g1

g4 g4

g6

g2

Timmd11 g7 g7

Pup

Pfprob

Psysfail

Pnodefail2

Tnoderepair

TnodefailTimmd1

Timmd2

n

Pnodefail1

Tsysrepair

Timmd4

g1

g1

Tfprob

Tcmode

n

g1

Timmd6

Timmd5

g2

Timmd3g3

g5

(1 � c)

c

#

#

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

328

for Pup and Pfprob are needed. The transitions Tfprobrejuv andTnodefailrejuv have the same rates as transitions Tfprob andTnodefail. Node repair (transition Tnoderepair) is disabledduring rejuvenation. Rejuvenation is complete when thesum of nodes in places Prejuved, Pfprobrejuv, and Pnodefail2 isequal to the total number of nodes, n, in the clustersystem. The immediate transition T immd10 is fired whenrejuvenation is complete or when the entire system failsduring rejuvenation. When the entire system fails duringrejuvenation, tokens are drained from all places in whichit is possible to have tokens. Transition T immd10 also fireswhen, just before rejuvenation starts, there are a 2 1tokens in place Pnodefail2. Since a individual node failuresleads to a system crash, and rejuvenation is considered a“down” state, rejuvenation does not take place if there area 2 1 individual node failures in the system. In this case,the clock simply begins the countdown again. The guardfunction g6 checks for these three conditions. The clockis disabled when the system is undergoing repair (whenthere is a token in place Psysfail).

We have approximated the deterministic clock usingan r-stage Erlang distribution [36]. If the rejuvenationinterval is d time units, each of the r Erlang stages isexponentially distributed with mean d/r. The mean of thisErlang distribution is thus d.

Prediction-based rejuvenationThe SRN model for a prediction-based rejuvenation policyfor a cluster system is shown in Figure 24. The failure-repair characteristics of the system are the same asbefore. There is no clock in this case, and rejuvenation isattempted only when a node transits into the “degraded”state. In practice, this degraded state could be predictedin advance by means of analyses of observable systemparameters as described above and in [11, 12]. Assumethat prediction succeeds with probability c2 . If a predictionis successful, a token is deposited in place Pdetect. If no othernode is being rejuvenated at that time, the newly detectednode can be rejuvenated. Once rejuvenation is completed,the token is put back in place Pup. Note that a node isallowed to fail even while waiting for rejuvenation. Ifthe prediction is unsuccessful, a token is deposited inplace Pdetectfail and then transits to the degraded stateunchecked. Transitions Tnodefail1 and Tnodefail2 have thesame rates.

AcknowledgmentsThe authors gratefully acknowledge the constructivecomments of David Bradley of IBM and KaterinaGoseva–Popstojanova of Duke University. This work was

Figure 24

Cluster system employing prediction-based rejuvenation.

Trejuv

Pdetect

Prejuv

Timmd8

Timmd9

Timmd11

c2

(1 � c2) Timmd10

Timmd7

Pdetectfail

Tnodefail2

Tnodefail1

Timmd5g1

g1

g4

g1

#

#

Pup

Psysfail

Pnodefail2

Tnoderepair

Timmd1

Timmd2

n

Pnodefail1

Tsysrepair

Timmd4

g1

Tfprob

Pfprob

Tcmode

n

g1

Timmd6

g2

g5

Timmd3 g3

(1 � c1)

c1

#

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

329

supported by IBM as an enhancement project of theCenter for Advanced Computing and Communication,Duke University.

*Trademark or registered trademark of International BusinessMachines Corporation.

**Trademark or registered trademark of MicrosoftCorporation, Novell, Inc., The Santa Cruz Operation, Inc.,or Sun Microsystems, Inc.

References1. J. Gray and D. P. Siewiorek, “High-Availability Computer

Systems,” IEEE Computer 24, 39 – 48 (September 1991).2. M. Sullivan and R. Chillarege, “Software Defects and

Their Impact on System Availability—A Study of FieldFailures in Operating Systems,” Proceedings of the 21stIEEE International Symposium on Fault-TolerantComputing, 1991, pp. 2–9.

3. Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton,“Software Rejuvenation: Analysis, Module andApplications,” Proceedings of the 25th Symposium on FaultTolerant Computer Systems, Pasadena, CA, June 1995, pp.381–390.

4. A. Avritzer and E. J. Weyuker, “Monitoring SmoothlyDegrading Systems for Increased Dependability,”Empirical Software Eng. J. 2, No. 1, 59 –77 (1997).

5. L. Bernstein, text of seminar delivered by Bernstein,University Learning Center, George Mason University,Fairfax, VA, January 29, 1996.

6. B. O. A. Grey, “Making SDI Software Reliable ThroughFault-Tolerant Techniques,” Defense Electron. 19, 77– 80,85– 86 (August 1987).

7. A. T. Tai, S. N. Chau, L. Alkalaj, and H. Hecht, “On-Board Preventive Maintenance: Analysis of Effectivenessand Optimal Duty Period,” Proceedings of the 3rdInternational Workshop on Object Oriented Real-TimeDependable Systems, Newport Beach, CA, February 1997,pp. 40 – 47.

8. E. Marshall, “Fatal Error: How Patriot Overlooked aScud,” Science, p. 1347 (March 13, 1992).

9. G. Pfister, In Search of Clusters: The Coming Battle inLowly Parallel Computing, Prentice-Hall, Inc., EnglewoodCliffs, NJ, 1998.

10. K. Hwang, Advanced Computer Architecture: Parallelism,Scalability and Programmability, McGraw-Hill Book Co.,Inc., New York, 1993.

11. S. Garg, A. van Moorsel, K. Vaidyanathan, and K.Trivedi, “A Methodology for Detection and Estimationof Software Aging,” Proceedings of the 9th InternationalSymposium on Software Reliability Engineering, Paderborn,Germany, November 1998, pp. 282–292.

12. K. Vaidyanathan and K. S. Trivedi, “A Measurement-Based Model for Estimation of Resource Exhaustion inOperational Software Systems,” Proceedings of the TenthIEEE International Symposium on Software ReliabilityEngineering, Boca Raton, FL, November 1999, pp. 84 –93.

13. R. Chillarege, S. Biyani, and J. Rosenthal, “Measurementof Failure Rate in Widely Distributed Software,”Proceedings of the 25th IEEE International Symposiumon Fault Tolerant Computing, Pasadena, CA, July 1995,pp. 424 – 433.

14. R. K. Iyer and D. J. Rossetti, “Effect of System Workloadon Operating System Reliability: A Study on IBM 3081,”IEEE Trans. Software Eng. SE-11, No. 12, 1438 –1448(December 1985).

15. I. Lee, “Software Dependability in the OperationalPhase,” Ph.D. thesis, Department of Electrical and

Computer Engineering, University of Illinois, Urbana–Champaign, IL, 1995.

16. R. K. Iyer, L. T. Young, and P. V. K. Iyer, “AutomaticRecognition of Intermittent Failures: An ExperimentalStudy of Field Data,” IEEE Trans. Computers 39, No. 4,525–537 (April 1990).

17. D. Tang and R. K. Iyer, “Dependability MeasurementModeling of a Multicomputer System,” IEEE Trans.Computers 42, No. 1 (January 1993).

18. A. Thakur and R. K. Iyer, “Analyze-NOW—AnEnvironment for Collection and Analysis of Failures in aNetwork of Workstations,” Proceedings of the InternationalSymposium on Software Reliability Engineering, WhitePlains, NY, April 1996, pp. 14 –23.

19. R. A. Maxion and F. E. Feather, “A Case Study ofEthernet Anomalies in a Distributed ComputingEnvironment,” IEEE Trans. Reliability 39, No. 4 (1990).

20. S. Garg, A. Puliafito, and K. S. Trivedi, “Analysis ofSoftware Rejuvenation Using Markov RegenerativeStochastic Petri Net,” Proceedings of the Sixth InternationalSymposium on Software Reliability Engineering, Toulouse,France, October 1995, pp. 180 –187.

21. S. Garg, Y. Huang, C. Kintala, and K. S. Trivedi, “Timeand Load Based Software Rejuvenation: Policy,Evaluation and Optimality,” Proceedings of the First Fault-Tolerant Symposium, Madras, India, December 22–25,1995.

22. S. Garg, Y. Huang, C. Kintala, and K. S. Trivedi,“Minimizing Completion Time of a Program byCheckpointing and Rejuvenation,” Proceedings of the 1996ACM SIGMETRICS Conference, Philadelphia, PA, May1996, pp. 252–261.

23. A. Pfening, S. Garg, A. Puliafito, M. Telek, and K. S.Trivedi, “Optimal Rejuvenation for Tolerating SoftFailures,” Perform. Eval. 27 & 28, 491–506 (October1996).

24. S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi,“Analysis of Preventive Maintenance in TransactionsBased Software Systems,” IEEE Trans. Computers 47,No. 1, 96 –107 (January 1998).

25. S. W. Hunter and W. E. Smith, “Availability Modelingand Analysis of a Two Node Cluster,” Proceedings of the5th International Conference on Information Systems,Analysis and Synthesis, Orlando, FL, October 1999.

26. V. B. Mendiratta, “Reliability Analysis of ClusteredComputing Systems,” Proceedings of the Ninth IEEEInternational Symposium on Software ReliabilityEngineering, Paderborn, Germany, November 1998,pp. 268 –272.

27. M. R. Lyu and V. B. Mendiratta, “Software FaultTolerance in a Clustered Architecture: Techniques andReliability Modeling,” Proceedings of the 1999 IEEEAerospace Conference, Snowmass, CO, March 1999,pp. 141–150.

28. J. K. Muppala, G. Ciardo, and K. S. Trivedi, “StochasticReward Nets for Reliability Prediction,” Commun. RMS,July 1994.

29. K. Vaidyanathan, R. E. Harper, S. W. Hunter, and K. S.Trivedi, “Analysis of Software Rejuvenation in ClusterSystems,” Proceedings of the Joint International Conferenceon Measurement and Modeling of Computer Systems, ACMSIGMETRICS 2001/PERFORMANCE 2001, Cambridge,MA, 2001, in press.

30. G. Ciardo, J. Muppala, and K. S. Trivedi, “SPNP:Stochastic Petri Net Package,” Proceedings of the ThirdInternational Workshop on Petri Nets and PerformanceModels, Kyoto, Japan, 1989, pp. 142–151.

31. C. Hirel, B. Tuffin, and K. S. Trivedi, “SPNP: StochasticPetri Nets. Version 6.0,” Computer PerformanceEvaluation: Modeling Tools and Techniques; 11th

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

330

International Conference, TOOLS 2000, Schaumburg, IL,B. Haverkort, H. Bohenkamp, and C. Smidts, Eds.,Lecture Notes in Computer Science, Vol. 1786, SpringerVerlag, 2000, pp. 354 –357.

32. C. Gage, IBM Secureway Network Dispatcher 2.1; http://www-4.ibm.com/software/network/dispatcher/library/.

33. H. Bozdogan, “Model Selection and Akaike’s InformationCriterion (AIC): The General Theory and Its AnalyticalExtensions,” Psychometrika 53, No. 3, 345–370 (1987).

34. N. R. Draper and H. Smith, Applied Regression Analysis,2nd ed., John Wiley & Sons, Inc., New York, 1981.

35. K. S. Trivedi, Probability and Statistics with Reliability,Queuing, and Computer Science Applications, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1982.

36. C. Wang, D. Logothetis, K. S. Trivedi, and I. Viniotis,“Transient Behavior of ATM Networks UnderOverloads,” Proceedings of IEEE INFOCOM, Vol. 3,San Francisco, CA, March 24 –28, 1996, pp. 978 –985.

Received September 8, 2000; accepted for publicationMarch 30, 2001

Vittorio Castelli IBM Research Division, Thomas J. WatsonResearch Center, P.O. Box 218, Yorktown Heights, New York10598 ([email protected]). Dr. Castelli received a “Laurea”degree in electrical engineering from the Politecnico diMilano in 1988, with a thesis in bioengineering; he receivedan M.S. in electrical engineering in 1990, an M.S. in statisticsin 1994, and a Ph.D. in electrical engineering in 1995 with adissertation on information theory and statistical classification,all from Stanford University. In 1995 he joined the IBMThomas J. Watson Research Center as a Postdoctoral Fellow,and became a Research Staff Member in 1996. From 1996to 1998 he was co-investigator of NASA/CAN Project No.NCC5-101. Dr. Castelli is currently a member of the SystemsTheory and Analysis group, where he works on memorycompression, prediction, and performance analysis. His mainresearch interests include information theory, statistics, andclassification, and their applications to performance analysisand computer architecture. He is a member of Sigma Xi, theIEEE Information Technology Society, and the AmericanStatistical Association. Dr. Castelli has published papers oncomputer-assisted instruction, statistical classification, datacompression, image processing, multimedia databases,database mining, and multidimensional indexing structures.

Richard E. Harper IBM Research Division, Thomas J.Watson Research Center, P.O. Box 218, Yorktown Heights, NewYork 10598 ([email protected]). Dr. Harper has been aResearch Staff Member at the IBM Thomas J. WatsonResearch Center since 1998. Previously he was a SeniorTechnical Consultant at Stratus Computer in Marlboro,Massachusetts, from 1995 to 1998, and a Principal Member ofthe Technical Staff at the Charles Stark Draper Laboratoryfrom 1987 to 1995. He received his Ph.D. in computer systemsengineering from the Massachusetts Institute of Technologyin 1987, his Master of Science degree in aeronauticalengineering from Mississippi State University in 1978, andhis Bachelor of Science degree in physics from MississippiState University in 1977. His interests are in the design,analysis, and implementation of highly reliable high-performance computing systems.

Philip Heidelberger IBM Research Division, Thomas J.Watson Research Center, P.O. Box 218, Yorktown Heights, NewYork 10598 ([email protected]). Dr. Heidelberger receiveda B.A. degree in mathematics from Oberlin College in 1974and a Ph.D. degree in operations research from StanfordUniversity in 1978. He has been a Research Staff Member atthe IBM Thomas J. Watson Research Center in YorktownHeights, New York, since 1978. His research interests includemodeling and analysis of computer performance, probabilisticaspects of discrete-event simulations, and parallel simulation.He has won Best Paper awards at the ACM SIGMETRICSand ACM PADS (Parallel and Distributed Simulation)Conferences, and he was twice awarded the INFORMSCollege on Simulation’s Outstanding Publication Award.Dr. Heidelberger has recently served as Editor-in-Chiefof the ACM’s Transactions on Modeling and ComputerSimulation. He is the General Chairman of the ACMSIGMETRICS/Performance 2001 Conference and has servedas the Program Chairman of the 1989 Winter SimulationConference and as the Program Co-Chairman of the ACMSIGMETRICS/Performance ’92 Conference. He is a Fellow ofthe ACM and the IEEE.

IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001 V. CASTELLI ET AL.

331

Steven W. Hunter IBM Server Group, 3039 CornwallisRoad, Research Triangle Park, North Carolina 27709([email protected]). Dr. Hunter is a Senior Engineer in thexSeries Architecture and Technology Department of the IBMServer Group. In addition to software rejuvenation, he iscurrently involved with future high-density rack-mountsystems, the Summit architecture, clustering, web-hostingand networking technology. Prior to this, he was in theIBM Networking Hardware Division, where he worked onnumerous networking products and high-speed networkingtechnology. Dr. Hunter received the B.S. degree in electricalengineering from Auburn University, the M.S. degree inelectrical and computer engineering from North CarolinaState University, and the Ph.D. degree in electrical andcomputer engineering from Duke University. His researchinterests include systems and networking architecture, design,analysis, and implementation. He is also an adjunct professorat North Carolina State University, where he currently teachesa course on high-speed networking.

Kishor S. Trivedi Center for Advanced Computing andCommunication, Department of Electrical and ComputerEngineering, Duke University, Durham, North Carolina 27708([email protected]). Dr. Trivedi received the B.Tech. degreefrom the Indian Institute of Technology (Bombay), and M.S.and Ph.D. degrees in computer science from the University ofIllinois, Urbana–Champaign. He holds the Hudson Chair inthe Department of Electrical and Computer Engineeringat Duke University, and has a joint appointment in theDepartment of Computer Science at Duke. He has been onthe Duke faculty since 1975. Dr. Trivedi is the Duke SiteDirector of an NSF industry– university cooperative researchcenter between North Carolina State University and DukeUniversity for carrying out applied research in computing andcommunications. He has served as a Principal Investigatoron various projects funded by AFOSR, ARO, Burroughs,DARPA, the Draper Laboratory, IBM, DEC, Alcatel,Telcordia, Motorola, NASA, NIH, ONR, NSWC, Boeing,Union Switch and Signals, NSF, and SPC, and as a consultantto industry and research laboratories. Dr. Trivedi was anEditor of the IEEE Transactions on Computers from 1983 to1987. He is a co-designer of the HARP, SAVE, SHARPE,SPNP, and SREPT modeling packages, which have beenwidely circulated. He is the author of the book Probabilityand Statistics with Reliability, Queuing and Computer ScienceApplications (Prentice Hall) and has recently published twobooks entitled Performance and Reliability Analysis ofComputer Systems (Kluwer Academic Publishers) andQueueing Networks and Markov Chains (John Wiley).Dr. Trivedi’s research interests are in reliability andperformance assessment of computer and communicationsystems. He has published more than 250 articles andlectured extensively on these topics, and has supervised32 Ph.D. dissertations. He is a Fellow of the Instituteof Electrical and Electronics Engineers and a GoldenCore Member of the IEEE Computer Society.

Kalyanaraman Vaidyanathan Center for AdvancedComputing and Communication, Department of Electricaland Computer Engineering, Duke University, Durham, NorthCarolina 27708 ([email protected]). Mr. Vaidyanathan receivedhis bachelor’s degree in computer science from the Universityof Madras, India, in 1996 and the master’s degree in electricaland computer engineering from Duke University in 1999. Heis currently a Ph.D. student at Duke and received the IBM

Graduate Fellowship Award in 2000. His research interestsinclude fault-tolerant systems, software reliabilityand performance, and dependability evaluation of computersystems.

William P. Zeggert IBM Server Group, 3039 CornwallisRoad, Research Triangle Park, North Carolina 27709([email protected]). Mr. Zeggert received his B.S. degreein computer science from Appalachian State University in1993. His undergraduate research included a statistical librarywhich assisted in proving various statistical characteristics,specifically in an M-Erlang distribution. Mr. Zeggert hasexperience as a software development consultant in thefinancial, telecommunication, and network administrationindustry, developing cross-platform heterogeneous systemsfor banking and insurance transaction processing, Europeancellular phone networks and networking/SNMP event/problemtracking frameworks as developmental backbone for large-scale software systems. Mr. Zeggert has also worked as acivil engineering research scientist providing computationalexpertise within the research engineering community. Heresearched and developed optimization algorithms and3D/OpenGL modeling in munitions effects assessment andworked on the development of a Linux parallel computationalsystem to streamline statistical processing of wind dynamics.Mr. Zeggert is now an IBM software engineer providingleadership and development expertise for Intel-hardware-specific system administration utilities including three-tierJava/Swing-based software providing flexible and portableadministration tools such as Fuel Gauge, Cluster SystemsManagement, System Availability, and Software Rejuvenation.He provided the presentation mechanism for softwarerejuvenation and assisted in the proving and viability ofrejuvenation algorithms in the Microsoft OS environment.Mr. Zeggert is currently leading the software quality initiativein the IBM eSeries. His role includes ensuring that the eSeriessoftware, including but not limited to software rejuvenation, isdeveloped so that the customer is provided with the highestquality of software at the lowest possible cost. This cost ismeasured in development, test, and support dollars.

V. CASTELLI ET AL. IBM J. RES. & DEV. VOL. 45 NO. 2 MARCH 2001

332