Reliability Testing of Applications on Windows NT

10
Reliability Testing of Applications on Windows NT Timothy Tsai Reliable Software Technologies 21351 Ridgetop Circle, Suite 400 Dulles, VA 20166 USA [email protected] Navjot Singh Bell Labs Research, Lucent Technologies 600 Mountain Ave, Rm. 2B-413 Murray Hill, NJ 07974 USA [email protected] Abstract The DTS (Dependability Test Suite) fault injection tool can be used to (1) obtain fault injection-based evaluation of system reliability, (2) compare the reliability of differ- ent applications, fault tolerance middleware, and platforms, and (3) provide feedback to improve the reliability of the target applications, fault tolerance middleware, and plat- forms. This paper describes the architecture of the tool as well as the procedure for using the tool. Data from exper- iments with the DTS tool used on the Apache web server, the Microsoft IIS web server, and the Microsoft SQL Server, along with the Microsoft Cluster Server (MSCS) and Bell Labs watchd (part of NT-SwiFT) fault tolerance packages is presented to demonstrate the utility of the tool. The ob- servations drawn from the data also illustrate the strengths and weaknesses of the tested applications and fault toler- ance packages with respect to their reliability. 1. Introduction Microsoft Windows NT is becoming a platform of choice for many applications, including services with dependabil- ity requirements. While the advantages of NT include de- creased cost and leveraging of commercial development, testing, and support, the dependability of NT is a concern, especially compared to Unix systems that have traditionally formed the foundation for many high dependability prod- ucts. In order to address this concern, several vendors, in- cluding Microsoft [10] and Bell Labs [13], have produced high availability software solutions that mostly depend on resource and process monitoring coupled with application restarts to handle error conditions. The available commercial solutions all claim to increase availability by tolerating a variety of faults. However, these This work was performed while the author was with Bell Labs Re- search, Lucent Technologies, Murray Hill, NJ, USA. claims are usually not substantiated by rigorous testing but rather are based on a combination of analytical modeling, simulation, component analysis, and experience. One significant obstacle to the task of systematic test- ing of system dependability, in terms of either availability or another quantity, is the lack of easy-to-use fault injec- tion tools. Fault injection is a necessity when testing the robustness of a system to unintended or unexpected events, because such events are often difficult to produce through traditional testing methods. This work addresses the need for an easy-to-use fault injection tool that can be used for a variety of software projects based on Windows NT. The Dependability Test Suite (DTS) is a tool for testing the error and failure detection and recovery functionality of a server application. Most of the code for the tool has been written in Java to produce a simple, yet practical graphical interface and to facilitate portability among different appli- cations. This paper describes the DTS fault injection tool and il- lustrates its use with actual applications. The DTS tool is described in Section 3. Section 4 gives the results of exper- iments to illustrate the use of the DTS tool in (1) comparing the reliability of fault tolerance middleware, (2) comparing the reliability of applications with similar functionality, and (3) providing useful feedback to improve the target system. A summary and ideas for future work are given in Section 5. 2. Related Work The current state of the art in fault injection includes many fault injection mechanisms and tools. Iyer [6] and Voas [17] provide good summaries of many techniques and tools, as well as background and further references. DTS depends on a method of fault injection called software-implemented fault injection (SWIFI). Instead of using hardware fault injectors or simulation environments, SWIFI tools use software code to emulate the effects of hardware and software faults on a real system. Such tools include FIAT [1], FERRARI [7], FINE [8], FTAPE [15],

Transcript of Reliability Testing of Applications on Windows NT

Reliability Testingof Applications on Windows NT

TimothyTsaiReliableSoftwareTechnologies

�21351RidgetopCircle,Suite400

Dulles,VA [email protected]

Navjot SinghBell LabsResearch,LucentTechnologies

600MountainAve,Rm. 2B-413Murray Hill, NJ 07974USA

[email protected]

Abstract

TheDTS(DependabilityTestSuite)fault injection toolcan be usedto (1) obtain fault injection-basedevaluationof systemreliability, (2) compare the reliability of differ-entapplications,fault tolerancemiddleware, andplatforms,and (3) provide feedback to improve the reliability of thetarget applications,fault tolerancemiddleware, and plat-forms. Thispaperdescribesthe architecture of the tool aswell as the procedure for usingthe tool. Data from exper-imentswith the DTStool usedon the Apache webserver,theMicrosoftIIS webserver, andtheMicrosoftSQLServer,along with the Microsoft ClusterServer(MSCS)and BellLabswatchd (part of NT-SwiFT)fault tolerancepackagesis presentedto demonstrate the utility of the tool. Theob-servationsdrawnfromthedataalsoillustratethestrengthsand weaknessesof the testedapplicationsand fault toler-ancepackageswith respectto their reliability.

1. Intr oduction

MicrosoftWindowsNT isbecomingaplatformof choicefor many applications,includingserviceswith dependabil-ity requirements.While the advantagesof NT includede-creasedcost and leveragingof commercialdevelopment,testing,andsupport,the dependabilityof NT is a concern,especiallycomparedto Unix systemsthathavetraditionallyformed the foundationfor many high dependabilityprod-ucts. In orderto addressthis concern,severalvendors,in-cluding Microsoft [10] andBell Labs[13], have producedhigh availability softwaresolutionsthat mostly dependonresourceandprocessmonitoringcoupledwith applicationrestartsto handleerrorconditions.

Theavailablecommercialsolutionsall claim to increaseavailability by toleratingavarietyof faults.However, these�This work wasperformedwhile the authorwas with Bell LabsRe-

search,LucentTechnologies,Murray Hill, NJ,USA.

claimsareusuallynot substantiatedby rigoroustestingbutratherarebasedon a combinationof analyticalmodeling,simulation,componentanalysis,andexperience.

One significantobstacleto the task of systematictest-ing of systemdependability, in termsof eitheravailabilityor anotherquantity, is the lack of easy-to-usefault injec-tion tools. Fault injection is a necessitywhen testingtherobustnessof a systemto unintendedor unexpectedevents,becausesucheventsareoften difficult to producethroughtraditional testingmethods.This work addressesthe needfor aneasy-to-usefault injectiontool thatcanbeusedfor avarietyof softwareprojectsbasedonWindowsNT.

TheDependabilityTestSuite(DTS) is a tool for testingtheerrorandfailuredetectionandrecovery functionalityofa serverapplication.Most of thecodefor thetool hasbeenwritten in Java to producea simple,yet practicalgraphicalinterfaceandto facilitateportability amongdifferentappli-cations.

This paperdescribestheDTS fault injectiontool andil-lustratesits usewith actualapplications.The DTS tool isdescribedin Section3. Section4 givestheresultsof exper-imentsto illustratetheuseof theDTStool in (1) comparingthereliability of fault tolerancemiddleware,(2) comparingthereliability of applicationswith similar functionality, and(3) providing usefulfeedbackto improvethetargetsystem.A summaryandideasfor futurework aregivenin Section5.

2. RelatedWork

The currentstateof the art in fault injection includesmany fault injection mechanismsand tools. Iyer [6] andVoas[17] providegoodsummariesof many techniquesandtools,aswell asbackgroundandfurtherreferences.

DTS dependson a method of fault injection calledsoftware-implementedfault injection (SWIFI). Insteadofusinghardwarefault injectorsor simulationenvironments,SWIFI tools usesoftware code to emulatethe effects ofhardwareandsoftwarefaultson a real system.Suchtoolsinclude FIAT [1], FERRARI [7], FINE [8], FTAPE [15],

DOCTOR [5], Xception[2], andMAFALDA [12]. Thesetoolshavebeenimplementedfor avarietyof operatingsys-tems,includingmany Unix variantsandreal-timeoperatingsystems.In contrastto DTS,noneof thesetoolswereimple-mentedonWindowsNT, althoughthearchitecturesof thesetools do not precludesuchan implementation.Rather, in-terestin WindowsNT andits reliability haverecentlybegunto increase.Also, many of thesetoolsfocuson thereliabil-ity of the operatingsystemor the platform ratherthanthereliability of applications.Fuzz[11] is onefault injectiontool thattestedthereliability of Unix utilities, applications,andservices.

ThebasicDTSarchitectureis notdependentonapartic-ular fault injection mechanism.However, the initial DTStool implementationis basedon the interceptionof librarycallsandcorruptionof library call parameters.Thismethodof fault injection is not unique. The Ballista [9] tool usesa similar techniqueto testthe robustnessof operatingsys-temsby fault injecting a setof commonsystemcalls usedto accessthefile system.TheBallistawork wasperformedon machinesrunning Mach and various flavors of Unix.Ghosh[4] presentsa tool for testingthe reliability andro-bustnessof WindowsNT softwareapplications.It shouldbenotedthatthis fault injectiontechniqueinjectsfaultsduringthe executionof the target programsandthereforeis verydifferentfrom mutationtesting[3], which injectsfaultsintosourcecodebeforecompilation.

None of thesetools or studiesinjects faults into highavailability systems.Thus,thefocusof thetestingis mostlyon thetargetapplicationsor theOS,in thecaseof Ballista.In addition,mostof the tools weredevelopedspecificallyfor thetypesof fault injectionperformed,ratherthanbeingmodularto becompatiblewith avarietyof faultmodelsandtargetprograms.

3. DTS

The main goals in designingthe DTS fault injectiontool wereeaseof use,automation,extensibility, portability,andmostimportantly, theability to produceusefulresults.Theseconsiderationswereimportantin determiningthear-chitecture,codinglanguage,anduserinterface.

Thetool is distributedwith themanagementanduserin-terfacesoftware residingon the control machineand thefault injection mechanism,workload generator, and datacollector presenton a separatetarget machine. This sep-arationof the control and target machinesis necessaryifthereis a possibility of a machinecrashcausedby an in-jectedfault. Otherwise,a machinecrashwould requirehu-maninterventionto restartthetestingprocess.In addition,adistributeddesignallows for testingof distributedsystems,especiallyif failover may occur or if correlatedfaults onmultiplemachinesareto beinjected.Nonetheless,although

thetool is distributedin nature,it maybeusedwith all com-ponentson a singlemachineif noneof the above issuesispertinent.

Themajorityof theDTScodeiswrittenin Java. TheJavalanguageincludesmany featuresthatfacilitatefastcodede-velopment.Thesefeaturesincludesocket creationanduse,threadmanagement,object-orientedsoftwarereuse,conve-nientgraphicallibraries,andportability. Thesmallportionof thecodethatcouldnot be implementedin Java usestheJava Native Interface(JNI) and C. The JNI-implementedcodeis usedfor processcontrolandothersystem-dependenttaskssuchasWindowsNT eventlog access.For portabilityreasons,Javadoesnotsupportanotionof processidentifiers(PID’s), which is neededto properly terminateprocesses,especiallythosethathave beenfault injectedandthereforemaynot berespondingto normalterminationmessages.

DTS is controlledvia a graphicalinterfaceanda setofconfigurationfiles. Onemain configurationfile is usedtospecifytestparameterssuchastimeoutperiods,a fault listfile name,andworkloadparameters.Thefault list file con-tainsa list of faultsto be injected.Workloadsarespecifiedby creatingparameterfiles with namesof applicationsorservicesto executeor by creatingJava classesthatareusedby the DTS workloadgenerator. More detailson the DTSarchitecturearecontainedin [14]. The user’s manual[16]containsdetailedinformationaboutthestepsneededto con-figureandto usethetool.

The DTS tool injectsfaultsby corruptingthe input pa-rametersto library calls. The resultingerrorsemulatetheeffects of several different types of faults, including ap-plication designandcodingdefectsandunintendedinter-actionsof the applicationwith the environmentand non-standardinput. For theresultsin Section4, themaingoalsof theexperimentsareto comparedifferentapplicationsandfault tolerancemiddleware. Thus,the mainconsiderationsfor selectingfaultsarethe ability to triggererror detectionandrecovery, theability to discover failurecoverageholes,andreproducibility. Otherexperimentsthataim to producea characterizationof a single system’s reliability (e.g., intermsof a reliability or availability estimate)will requireareal-world profileof thefaultsbeingmodeled.

The workload is the combinedsystemresourceusage(e.g.,usageof operatingsystemdatastructures,communi-cationports,etc.)causedby theexecutionof theapplicationprograms,thefault tolerancemiddleware,andtheoperatingsystem.A workloadgenerator is a setof programsthatini-tiatestheprogramsandcreatestheprograminputsandenvi-ronmentsin acontrolledandreproduciblemannerto gener-atea particularworkload.DTSassumethattheworkloadiscreatedby a client-serversetof programs.This assumptionis valid for many applicationsof interestbecausereliabil-ity concernsareparticularlyimportantfor serverprograms.The server programis also referredto as the “target pro-

gram” becausethe focusof the fault injectionsis to evalu-atethereliability of theserverprogram,in thecontext of theoperatingsystem,fault tolerancemiddleware,andtheclientprogram.Notethattheclientprogramaffectstheoverall re-liability of theclient-server systembecauseclient-initiatedactions,suchasclient requestretires,may be requiredforcorrectoperationin thepresenceof faults.Nonclient-servertypesof workloadscenariosarealsosupportedby DTS, in-cludingapplicationswith directuserinteraction.However,someadditionalcodingof Javaclassesmaybenecessary.

TheDTS datacollectorpresentsresultsthat includethefollowing:� Outcome:The outcomefor eachinjectedfault is one

of thefollowing:

1. Normal success:The server was able to pro-videcorrectresponsesto all requestswithoutanyserver restartsor requestretransmissions.

2. Server restartwith success:After a restartof theserver, theserverprovidedacorrectresponse.

3. Server restartandclient requestretry with suc-cess:After a restartof theserverandtheretrans-missionof at leastoneclient request,the serverprovideda correctresponse.

4. Client requestretry with success:After at leastoneclient requestwas retransmitted,the serverprovideda correctresponse.

5. Failure: At leastone of the client requestsdidnot succeed,eitherbecauseno responsewasre-ceived or an incorrect responsewas received.This meansthat the server has failed, and thefault tolerancemiddleware, if present,has notpreventedthefailureof theserver.

� Responsetime: Thetotal time for theclientandserverprogramsto complete.� Detailedresults:Thespecificresponseto eachindivid-ual request.

Most of the results are client-oriented,which meansthat most of the results can be determinedby examin-ing the client programbehavior. Usually the client pro-gramis a syntheticprogramthat is specificallywritten forDTS.Someresults,suchaswhethertheserverprogramhasbeenrestarted,cannotbe determinedfrom examining theclient programoutput. The determinationof server pro-gramrestartsis dependenton the middlewareusedto per-form therestart.Somemiddleware,suchasMicrosoftClus-ter Server [10], write outputto theWindows NT event log.Othermiddleware,suchasNT-SwiFT [13], createa sepa-ratelog file.

Figure1 showsthesequenceof actionsperformedby theDTS tool for an experiment. An experimentconsistsof aseriesof workloadsets(e.g.,all faultsfor Apache,for IIS,andfor SQL Server). Eachworkloadsetconsistsof a set

of fault injectionruns.A fault injectionrun includestheac-tionsassociatedwith theinjectionof asinglefault. For eachworkload(W), a setof faultsis injected. The setof faultsdependson thesetof functionsto inject (F), thenumberofparametersfor aparticularfunction(P),thenumberof itera-tionsto injectperfunction(I), andthenumberof fault types(T). This meansthat for a fault injection run, theworkloadprogramsarestarted,onefault is injected,andtheworkloadprogramsareterminated.Thefault injectionrun is repeateduntil all parametersof all functionshavebeeninjectedwithall fault types(actually, somefaultsareskippedif DTS de-terminesthatthefault will probablynot beactivated).

START

function

fault typeiterationparameter

workloadw

tipf

Gather results

Workload termination

Start client prog

Wait for server to be up

Start server prog

Prepare workload progs

Create fault param file

(fault is injected)

FI Run

FI Run

END

START

foreach (f , f , ..., f )

E0

0 1 D

C0

B0

A10

1

1

foreach (i , i , ..., i )

1foreach (t , t , ..., t )

foreach (p , p , ..., p )

foreach (w , w , ..., w )

Fault Injection Run

Workload Set Fault Injection Run

END

Figure 1. Experiment flo w char t

4. Experimental results

To demonstratetheutility of theDTStool, severalexper-imentswereperformed.Theserver programsstudiedwere(1) Apacheweb server version1.3.3 for Win32, (2) Mi-crosoft Internet Information Server (IIS) version3.0, and(3) MicrosoftSQLServerversion7. AlthoughIIS canserveasanHTTPserver, anFTPserver, andagopherserver, onlytheHTTP functionalitywastestedin theseexperiments.

Thefirst threeprogramswereexecutedasNT servicesinthreedifferentconfigurations:(1) asa stand-aloneservice,(2) with MicrosoftClusterServer(MSCS),and(3) with thewatchd componentof NT-SwiFT. All experimentswereconductedon the samemachines.The hardwareplatformwasa 100 MHz PentiumPC with 48 MB of memoryrun-ning Windows NT EnterpriseServer 4.0with ServicePack4. Additional experimentswereconductedon a faster400

MHz Pentium II PC with 128 MB of memory runningWindows NT EnterpriseServer 4.0 with ServicePack 4.Only theresultsfor theslower 100MHz Pentiummachinearepresentedherebecausethe fastermachinewasnot yetequippedwith MSCSin our lab. However, onthefasterma-chine,theresultsfor Apache,IIS, andSQLServerasstand-aloneservicesandwith watchd wereessentiallyidenticalto thoseon theslowermachine.

For eachserver program,a simple client programwascreatedto sendrequeststo the server program. For theApacheand IIS web servers, the HttpClient programsendstwo types of requests:(1) an HTTP requestfor a115 kB static HTML file and (2) an HTTP requestfor a1 kB staticHTML file via theCommonGateway Interface(CGI).For theSQLServer, theSqlClient programsendsan SQL select requestbasedon a single table. BothHttpClient andSqlClient checkthe correctnessofthe server reply. If the reply is incorrector if the reply isnot received within a timeoutperiod(a default of 15 sec-onds),the requestis retried. A secondretry is attemptedif necessary. Eachclient programwaits15 secondsbeforeattemptinga retry. After a correctreply is received or thethird attemptfails, the client programoutputsinformationaboutthesuccessor failureof therequestsandthenumberof retriesattempted.

For theNT programs,faultswereinjectedby intercept-ing all calls to the functionsin KERNEL32.dll. On ourmachine,KERNEL32.dll contains681 functions. Ofthose681 functions,130 functionshadno parametersandthus were not candidatesfor function parametercorrup-tion. The remaining551 functionswere injected. To de-creasethe total time for the experiments,only the first in-vocationof the eachfunction wasinjected(i.e., theCre-ateEventA() function is injected the first time it iscalled,but not thesecondor subsequenttimes).Furtherin-vocationscanalsobeinjected,but preliminaryexperimentsshowed that suchinjectionsproducedsimilar results. Foreachfunction, eachfunction parameterwas injectedwiththreetypesof faults: (1) resetall bits to zero, (2) set allbits to one,and(3) flip all bits (i.e., one’s complementforthe parametervalue). Eachparameterof every function isinjectedwith thesethreetypesof faults. Thus, for func-tionswith two parameters,6 differentfaultswill beinjected(2 parameterswith 3 fault typesfor eachparameter).Onlyone fault is injectedfor eachexecutionof the server pro-gram. Although thesetypesof corruptionmay seemsim-plistic, they werealreadyeffective in differentiatingamongdifferentworkloads(e.g.,MSCSvs. watchd)andin helpingto discover bugs that lead to failure scenarios.It may beinterestingto introduceadditionaltypesof corruptionbasedon datatypes(e.g.,treatingpointersandBooleanvariablesdifferently). However, this requiressymbolic informationandis compilerdependent,thusaffectingtheportability of

thefault injectionmethod.Threeserver programswerestudiedin theexperiments:

the Apacheweb server, the Microsoft IIS web server, andthe Microsoft SQL Server. Eachwas executedas an NTservice(1) with no fault-tolerancemiddleware, (2) withMSCS,and(3) with watchd. It shouldbe notedthat theoutcomeof theseexperimentsis dependentontheworkload(especiallytherequestsissuedby theclient andtheconfig-uration of the application)and the specificfaults that areinjected.

A particularserver programwill not necessarilycall allfunctions in a DLL. In fact, the majority of functions inKERNEL32.dll arenotcalled.Table1 showsthenumberof activatedfunctionsfor eachworkload. SeeSection4.1for an explanationof Apache1and Apache2. To shortenthe total time for the experiments,if an injectedfunctionis not called, all other injectionsfor that function will beskippedbecauseit is assumedthatthefunctionwill alsonotbecalledif theserverprogramis rerunfor thenext fault.

Table 1. Number of called KERNEL32.dllfunctions per workload

Fault-ToleranceMiddlewareServerProgram None MSCS watchd

Apache1 13 17 13Apache2 22 24 22IIS 76 76 70SQL 71 74 70

4.1. Comparison of fault tolerance middlewarepackages

Figure 2 shows NT results for comparisonsof theApachewebserver, theIIS webserver, andtheSQL Serverasstand-aloneNT services,with MSCS,andwith watchd.For theApachewebserver, theNT serviceconsistsof multi-pleprocesses.TheApachewebserverwasspecificallycon-figuredto startonly two processesfor thepurposesof theseexperiments.Thefirst processis amanagementprocessthatspawns child processesthat actually servicerequests.Bydefault,Apachespawnsmultiple child processes.Sincethetool only targetsoneprocessfor injection,if oneof theotherchild processespicks up the request,then injected faultsmaynotbeactivatedin areproduciblemanner. ConfiguringApachefor only onechild processguaranteesthatthesamechild processwill pick up therequesteachtime,thusensur-ing reproducibleresults. Two setsof resultsaregiven forinjectionsinto theApachewebserver, onesetfor injectionsinto the first process(labeledas “Apache1”in this paper)andasecondsetfor injectioninto thechild process(labeled

� �

� � �

� �

� � �

� � �

� � �

� � �

� � �

� � �

! " #

$%&' ()* +,

- .0/21 3 4 5 6 798 : ; < = > ? @ A BDC E F G H I J

K LNM O P Q R SUT V W X Y Z\[ ] ^ _ ` aUb c d e f g\h i j k l mUn o p q r s\t u v w x yUz { | } ~ ��� � � � � �\� � � � � �U� � � � � �\� � � � � �� � ��� � �   ¡£¢ ¤ ¥ ¦ § ¨U© ª « ¬ ­ ®°¯ ± ² ³ ´¶µ · ¸ ¹ º¼» ½ ¾ ¿ À¶Á Â Ã Ä ÅÇÆ È É Ê Ë Ì�Í Î Ï Ð Ñ ÒÇÓ Ô Õ Ö ×ÇØ Ù Ú Û Ü ÝUÞ ß à á â ãä å æ ç èêé ë ì í î¼ï ð ñ ò ó¶ô õ ö ÷ ø¼ù ú û ü ý¶þ ÿ � � ��� � � � � � � ��� � � � � ��� � � � � ��� � � !�" # $ % &(' ) * + ,- . /10 2 3 4 5�6 7 8 9 :; < = > ?A@ B C D E FHG I J K L MON P Q R S T�U V W X Y�Z [ \ ] ^�_ ` a b cd e f g h�i j k l mn o p q rsut v w x y z�{ | } ~ � ��� � � � ��� � � � �� � � � ��� � � � ��� � � � � �O� � � �   ¡�¢ £ ¤ ¥ ¦�§ ¨ © ª « ¬O­ ® ¯ ° ± ²�³ ´ µ ¶ ·

¸ ¹ º »O¼¾½ ¿ À Á Â Ã Ä Å ÆÈÇÊÉ Ë Ì Í Î Ï Ð Ñ ÒÈÓÊÔ Õ Ö × Ø Ù Ú Û ÜOÝ¾Þ ß à á âã ä å æ ç è éëê ì í î ï ð ñëò ó ô õ ö ÷ øëù ú û ü ý þ ÿ�� � � � � � �� � � � ��� � ��� � ��� � ��� ���! "�#%$ &�'

(*),+.-*/ 0 132 4 576 8 9 : ; <=?>�@,ACB7D E F G H I J K L MON P Q R S T U V7W X Y[Z7\ ] ^ _ ` ab c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ � � � � � � � � � � � � � � � � � � � �

�7� �,��� � � � � � �*�   ¡ ¢ £[¤¦¥ § ¨ª©*« ¬ ­ ® ¯ °±C².³ ´ µ ¶ · ¸ ¹

Figure 2. Standalone/MSCS/ watchd comparisons for Windo ws NT

as“Apache2”).IIS andSQL serverbothconsistof a singleprocessandarelabeledas“IIS” and“SQL” respectively.

Figure 2 shows NT results respectively for Apache1,Apache2,IIS, andSQL. Eachfigure shows the resultsforone workload as a stand-aloneservice,with MSCS, andwith watchd. The normalizedoutcomesof the workloadsetsaredisplayedgraphicallyin thechartsandnumericallybelow the charts. The possibleoutcomesare the five out-comesdescribedin Section3. Eachoutcomeis givenasapercentageof the total numberof activatedfaults for thatparticularworkload set. It shouldbe notedthat differentworkloadsets,even for the sameserver programcanpro-ducea differentnumberof activatedfaults,due to the ef-fect of the fault tolerancemiddlewareandthe influenceofnon-determinisminherentin the server programs. How-ever, theseeffectsdonotchangetheconclusionsthatcanbedrawn from thedata.Thefaultsinjectedinto theextra func-tionsthatarecalledby eachserverprogramdueto thefaulttolerancemiddlewareall resultin normalsuccessoutcomes,

and only one function exhibited non-deterministicbehav-ior: zeroingout all bits in thenNumberOfBytesToReadpa-rameterfor ReadFileEx()for SQL Server with theoriginalversionof watchd sometimescauseda detectederrorandsometimescausedasuccessfulrestart.

Several interestingobservationscanbe madefrom Fig-ure 2. First, perhapsthe most importantandobvious ob-servation is that both MSCSandwatchd areeffective inincreasingthe reliability of all threeserver programs.Thesolid blackportionsof thefiguresrepresentthefault injec-tion runsthatresultedin failures,i.e.,caseswheretheserverprogramwasnot ableto producethecorrectresponseevenafterrepeatedclient requestretries.Thefailurepercentagesfor all serverprogramsdecreasedmarkedlywhenMSCSorwatchd was used. In fact, for Apache1,all failure out-comeswereeliminatedusingwatchd.

Theeffectivenessof MSCSandwatchd in reducingthenumberof failuresis attributableto theirability to detectsit-uationsin which themonitoredserver programis malfunc-

tioning andthento initiate a recovery action,which entailsa server programrestartfor theseexperiments.Discount-ing theeffectsof non-determinismandadditionalactivatedfaultscausedby usingMSCSandwatchd, thenumberofnormalsuccessandrequestretrywith successremainessen-tially the samefor eachserver program. The differenceisreflectedin theportionof failureoutcomesthatbecomesuc-cesswith restartoutcomesdueto the MSCSandwatchdrestartmechanisms.

Figure2 alsorevealstheeffectivenessof theApachear-chitecturein handlingfaults. TheApachewebserver con-sists of multiple processes.The first process(Apache1)functions as a managementprocess. Its duties includespawning the additionalprocesses(Apache2)that actuallyserviceincomingweb requests.The first processdoesnotserviceany web requestsitself. If one of the Apache2processesdies, the Apache1processwill spawn anotherApache2process.This failure detectionandrestartmech-anism within Apache is similar to that for MSCS andwatchd. For this reason,MSCS and watchd, are ef-fective with theApache1processbut have no effect on theApache2process.Thereasonfor this lackof efficacy is thatbothMSCSandwatchd only monitorthefirst processthatis startedfor any application.Thus,thechild processesthatarespawnedby thefirst processarenotmonitored.BecausetheApache1processdoesnot serviceany webrequests,re-questretriesproduceno additional successoutcomes,asseenin Figure2. In addition,becausetheApache2processis not monitoredby MSCS or watchd, no restartsiniti-atedby MSCSor watchd occur. However, restartsof theApache2processby theApache1processdo occurandaremanifestedasnormalsuccessandrequestretrywith successoutcomes.

Figure2 shows thatwhile bothMSCSandwatchd de-creasethe numberof failure outcomes,watchd doesamuchbetterjob for thefault setused.In fairnessto MSCS,only the genericserviceresourcemonitor is used. A cus-tom serviceresourcemonitor that is specially tailored tointeractwith and monitor all aspectsof the IIS and SQLServer programswould probably improve the MSCS re-sults.However, Microsoftonly providesanAPI for creatingthecustomresourcemonitorsandnot theactualcustomre-sourcemonitors.Thus,thecomparisonbetweenMSCSandwatchd is basedon thedefaultMSCSandwatchd pack-ages.

4.2. Comparison of applications with similar func-tionality

From the experimentaldata,someinterestingobserva-tionsaboutthe relative reliability andperformancecharac-teristicsof Apacheand IIS canbe made. Figure3 showsthe outcomesof the fault injection runs for Apacheand

IIS asstand-aloneservices,with MSCS,andwith watchd.The Apacheresultsarea combinationof the Apache1andApache2resultsbecauseboth Apacheprocessesmust beconsideredin a comparisonto IIS, which includesits totalfunctionalityin asingleprocess.TheApache1andApache2resultsare weightedbasedon the relative numberof ac-tivated faults for eachprocess. Figure 3 shows that theApacheweb server exhibits a lower percentageof failureoutcomesthanIIS asastand-aloneservice,with MSCS,andwith watchd. As a stand-aloneserviceandwith MSCS,theoccurrenceof failureoutcomesfor IIS is twice that forApache.However, if watchd is used,thenthe differenceis not asgreat(7.60%vs. 5.80%)becausefar fewer faultsresultin failurewith watchd.

Table 1 shows that many more functionsare activatedfor IIS than for Apache. To view Apache and IIS ona more commonbasis,Table 2 comparesApacheto IIScountingonly faultsthatwereactivatedfor bothprograms.Fewer faults were activatedfor the Apache1processbe-causetheApache2processprovidesmostof thewebserv-ing functionality. Thethird row of datashows theApache1andApache2outcomesaddedtogether. As with Figure3,Apacheexhibits fewerfailuresthanIIS asastand-aloneser-vice, with MSCS,andwith watchd. However, thediffer-enceis even morepronounced(e.g.,5.7%vs. 26.0%fail-uresfor Apachevs. IIS asstand-aloneservicescomparedto20.58%vs. 41.90%in Figure3).

It is oftenusefulto considerperformancein thepresenceof faults. Figure 4 shows the averageresponsetimes forApacheand IIS asstand-aloneservices,with MSCS,andwith watchd. The responsetimes aregroupedbasedontheoutcomesof thefault injectionruns.Theoutcometypesare the samethoseas in Figures2 and3 with oneexcep-tion. Failureoutcomesarefurthersubdividedinto two out-comes:(1) Failureswherea responseis received from theserverprogram,but theresponseis incorrectand(2) failureswhereno responseis received.Obviously, if no responseisreceived, the responsetime will be infinite, and thereforethesefaultsareomittedfrom Figure4. No occurrencesof aparticularoutcomeexist in somecases.Theresponsetimesarein secondsandaregivenwith corresponding95%con-fidenceintervals(shown aserrorbarsin thefigure).

Some observations about the relative performanceofApacheand IIS can be draw from Figure 4. First, thereis no appreciabledifferencein the performanceoverheaddueto the useof MSCSor watchd in eitherapplication.Second,for faults that result in normalsuccessoutcomes,Apacheis faster, especiallywhenMSCSis used.Thenor-mal successoutcomeaverageresponsetimes for ApacheandIIS asstand-aloneservices(14.21vs. 18.94seconds)are essentiallythe sameas the correspondingaveragere-sponsetimesfor ApacheandIIS whennofaultsareinjected.Third, the averageresponsetimesassociatedwith applica-

º »

¼ ½ ¾

¿ À Á

 à Ä

Å Æ Ç

È É Ê

Ë Ì Í

Î Ï Ð

Ñ Ò Ó

Ô Õ Ö

× Ø Ù Ú

ÛÜÝÞ ßàáâ ã

ä å*æ�ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý

þ ÿ � � � � � � � � � � � � � ��� � � � � ��� � � � � ��� ! " # $�% & ' ( ) *�+ , - . / 01 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q RTS U V W XZY [ \ ] ^ _a` b c d egf h i j k l�m n o p q rs t u v w x y z { | } ~ � � � � � � � � ��� � � � � � � � � � � � � � � �� � � � � � �

  ¡ ¢ £ ¤T¥ ¦ § ¨ ©«ª ¬ ­ ® ¯T° ± ² ³ ´gµ ¶ · ¸ ¹ º¼» ½ ¾ ¿ À Á

Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þàß á â ã ä å¼æ ç è é ê ëgì í î ï ðTñ ò ó ô õTö ÷ ø ù úû ü ý þ ÿ � � � � � � � �� � � ��� � � � ��� � � � � ��� � � ! "�# $ % & '

( ) * +-,/. 0 1 2 3 4 5 6 7-8/9 : ; < => ? @ A B CED F G H I JEK L M N O PRQ S T U V W X Y Z

Figure 3. Comparison of Apac he to IIS

[

\ ]

^ _ `

a b c

d e f

g h i

j k l m n o p q r s t u vxw y z { | } ~ � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � �   ¡ ¢ £ ¤ ¥ ¦ § ¨ ©ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹º » ¼ ½ ¾ ¿ À

Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï ÐÑ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç èé ê ë ì í î ï ð ñ

òó ôõö ÷ø ùúûü ýþ

ÿ � � � � � � � � � � � � � � � � � � ��� � � � � � � � � !�" # $ % & ' (*)�+ , - . /

Figure 4. Average response times for Apac he and IIS (with 95% confidence inter vals)

Table 2. Comparison of Apac he to IIS counting onl y common faults

Fault-ToleranceMiddlewareStand-aloneservice With MSCS With watchd

ServerProgram Act

ivat

ed

Failu

re

Res

tart

Ret

ry

Act

ivat

ed

Failu

re

Res

tart

Ret

ry

Act

ivat

ed

Failu

re

Res

tart

Ret

ry

Apache1 30 20.0% 0% 0% 36 8.3% 8.3% 0% 30 0% 20.0% 0%Apache2 111 1.8% 0% 33.3% 120 2.5% 0% 30.8% 111 1.8% 0% 33.3%Apache1+Apache2 141 5.7% 0% 26.2% 156 3.8% 1.9% 23.7% 141 1.4% 0% 26.2%IIS 123 26.0% 0% 33.3% 135 9.6% 11.1% 40.0% 123 12.2% 22.0% 43.1%

tion restartsarelower for IIS thanfor Apache.Muchof thisdiscrepancy is dueto theway thatApacheseemsto handlesomeproblemsduringservicestartup.For somefaults,theApache1processdiesimmediatelyafterbeingstartedby theWindows NT ServiceControl Manager(SCM). However,theSCM assumesthat theserviceis in the“Start Pending”state.Whenany serviceis in apendingstate,theSCMlocksits database,which causesany statechangerequeststo theSCM to be denied. Thus,both MSCSandwatchd mustwait until the “Start Pending”statetimesout beforeiniti-ating a restartof the service. Although both ApacheandIIS experiencethis scenario,for Apachethenumberof oc-currencesis greaterandthe wait time for eachoccurrencebeforethependingstateendsis greater.

The main lessonsdrawn from Figure 4 are (1) bothMSCS andwatchd are comparablein impactingperfor-manceand (2) the applicationbeingmonitoredcanaffecthow quickly the fault tolerancemiddleware is able to re-cover from detectedproblems.

4.3. Fault tolerancemiddleware impr ovements

In additionto comparingfault tolerancemiddleware,theDTStool alsoplaysanimportantrolein theidentificationoffault tolerancemiddlewareweaknessesby suggestingwaysin which thefailurecoverageof thefault tolerancemiddle-ware can be improved. All outcomesfor individual faultinjection runs are recorded. Thus, the specificfaults thatresult in failure canbe studiedto determinethe reasonforthe hole in the failure coverage. This testingand debug-ging procedureis muchmoreeffective with the useof theDTSfault injectiontool. Fault injectionis necessaryto pro-ducethemoreesotericproblemscausedby thecombinationof faultswith suchfactorsasunexpectedinput or interac-tionsbetweenthreadsor processesor with theenvironment.Theseproblemscanbeespeciallypotentduringnon-steadystateperiodsof operation,suchasprocessinitialization orterminationor duringperiodsof stress.

The results from the initial experiment involving

watchd were studiedto improve the original versionofwatchd (Watchd1) and to createan improved version(Watchd2).Watchd1startsmonitoredprocessesby callingastartService() functionthatcommunicateswith theSCM to start the serviceprocess.In order to monitor thenewly createdprocess,watchd obtainsthe handleof thenew processby calling the getServiceInfo() func-tion. For operationin theabsenceof faults,callingstart-Service() followedby getServiceInfo() workedwell. However, somefaultscausedthe serviceprocesstofail afterstartService() wascalledandbeforeget-ServiceInfo() wascalled. This small window of op-portunitywassufficient to preventwatchd from correctlyobtainingthe necessaryprocesshandle,and thereforethefailed processcould not be monitoredand restarted.TheWatchd2versionmergedthefunctionalityof getServi-ceInfo() into startService().

Figure 5 shows the results of using Watchd1 andWatchd2 with Apache1, IIS, and SQL. The results forApache2are not shown becausewatchd has no effecton the outcomesfor Apache2,asdiscussedearlier in Sec-tion 4.1. As seenin Figure5, the Watchd2improvementshadmixedsuccess.Thefailureoutcomesfor Apache1actu-ally increased,while nochangewasseenfor SQL.Only IISwith Watchd2showedanimprovementin theresults,with adramaticdecreasein thepercentageof failureoutcomes.

A seconditerationof studyingtheWatchd2dataresultedin thecreationof anotherimprovedversion(Watchd3).TheWatchd2versioncombinedthetasksof startingtheserviceprocessandobtaininga processhandleto thenew processin a singlestartService() function. This decreasedthe time window of opportunityfor thenew processto failin betweenthe two tasks. However, the opportunity forthe new processto fail still existed. To addressthis prob-lem, theWatchd3versionexplicitly checksfor a valid pro-cesshandlebeforereturningfrom thestartService()function. If the processhandleis not valid, thena new at-tempt to start the serviceprocessoccurs. The check forthe valid processhandleis further augmentedby commu-

0 1

2 3 4

5 6 7

8 9 :

; < =

> ? @

A B C

D E F

G H I

J K L

M N O P

QRST UV WX Y

Z []\ ^ _ ` a b c d e f g h i j k l m n o p q r s t

u v w x y z { | } ~ � � � � � � � � ��� � � � � ��� � � � � ��� � � � � ��� � � �   ¡�¢ £ ¤ ¥ ¦ §�¨ © ª « ¬ ­�® ¯ ° ± ² ³�´ µ ¶ · ¸ ¹º » ¼ ½ ¾ ¿ À Á  à Ä*Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü�Ý Þ ß à á â�ã ä å æ ç èêé ë ì í îðï ñ ò ó ô õ�ö ÷ ø ù ú ûðü ý þ ÿ ��� � � � ��� � � � � � � � � � � � � ��� � � � � � ! " #�$ % & ' ( ) * + , - . / 0 1 2 34 5 6 7 8 9 :

; < = > ?A@ B C D EAF G H I J�K L M N O PRQ S T U V WYX Z [ \ ] ^�_ ` a b c�d e f g hji k l m n

o p q r s t u v w x y z { | } ~ � � � � � � � � � � � �A� � � � �A� � � � ��� � � � ��� � � �  A¡ ¢ £ ¤ ¥A¦ § ¨ © ªA« ¬ ­ ® ¯�° ± ² ³ ´µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À ÁYÂ Ã Ä Å Æ ÇÉÈ Ê Ë Ì ÍÉÎ Ï Ð Ñ Ò ÓÕÔ Ö × Ø Ù�Ú Û Ü Ý ÞÕß à á â ã äæå ç è é ê ëÕì í î ï ð

ñóò ô õ ö ÷ øúùóû ü ý þ ÿ ����� � � � � ��� � � � ����� � � � � ����� � � � � �!�" # $ % & ')(�* + , - . /�0�1 2 3 4 5 67 8 9 : ; < =�> ? @ A B C D�E F G H I J KML N OQP R SUT V WYX Z []\ ^ _]` a b

Figure 5. Comparison of original to impr oved watchd

nicationwith theSCMto ensurethattheserviceis properlystarted.Theseimprovementsdramaticallyimprovedthere-sultsfor Apache1andSQL, asshown in Figure5. The re-sultsfor IIS wereunchangedcomparedto the resultswithWatchd2. However, a dramaticimprovementhadalreadybeenobtainedwith theWatchd2improvements.

It shouldbe noted that the chart in Figure 2 includesthe resultsusing Watchd3. For Apache1,IIS, and SQL,the resultswith Watchd1wereall slightly worsethanwithMSCS.However, the resultswith Watchd3wereall muchbetterthanwith MSCS.The conclusionthat canbe drawnis thattheiterativeimprovementsusingtheDTStool helpedwatchd in a significantway.

5. Conclusion

Thispaperdescribedthearchitectureanduseof theDTSfault injectiontool. Experimentswith thetool demonstratedtheusefulnessof thetool in severalways.

First, the mostpracticaluseof the tool is in the systemvalidationphaseof testing. Individual fault injection runscanbeusedto providereproduciblefeedbackfor improvingthe targetsystem.The improvementmay target the serverprogram,the fault tolerancemiddleware,or the operatingsystem.TheDTS architecturefacilitatesthe testingof dif-ferent applications,middleware,andsystems.This paper

showed the dramatic fault coverageimprovementgainedfor the watchd middleware. Similar improvementsarealsopossiblefor otherfault tolerancemiddleware,suchasMSCS,or for serverprogramsor theoperatingsystem.Theessentialcontribution of fault injection is the triggeringofscenariosthat would not normally be encounteredin thecourseof conventionalfunctionaltesting.

Second,theresultsof DTSexperimentscanbeusedasastartingpoint for comparingthe reliability of applicationson Windows NT. Certainly, attentionhas to be given tothe selectionof the experimentalfault andworkload sets.Nonetheless,the DTS tool is useful asa testbed for per-forming fault injection-basedevaluation of specific sys-tems,althoughcarehasto betakenin generalizingconclu-sionsaboutthe intrinsic reliability of a particularapplica-tion, operatingsystem,or fault tolerancemiddleware.

Several experimentsusing the DTS tool with severalserver programsand fault tolerancemiddlewarepackagesonaWindowsNT platformwereperformed.Theresultsin-dicatethatbothMSCSandwatchd areusefulfor increas-ing the failure coverageof the system(as representedbyunity minus the percentageof failure outcomes). In par-ticular, theimprovedwatchd exhibitedhigh failurecover-age(greaterthan90%) for all testedserver programs.Thewatchd failurecoveragewashigherthanfor MSCS.TheApacheandIIS serverprogramswerebothtargetedfor test-ing to demonstratethe useof the DTS tool in comparing

server programswith similar functionality. Both reliabilityandperformanceresultswereobtained. The Apachewebserver exhibited greaterreliability andbetterperformancefor situationswhereno applicationrestartor client requestretry wasrequired. However, whenrestartwasnecessary,IIS wasmuchfaster.

The current work has beenperformedon a WindowsNT platform. The DTS tool hasalreadybeenported tothe Linux platform with minimal effort. Only system-dependentJava Native Interfacecomponentsneededto berewritten. Testing Apache on Linux with and withoutwatchd hasobtainedpreliminaryresults.Work is ongoingto determineappropriatefault andworkloadsetsthat willallow theLinux resultsto becomparedto theWindowsNTresults. The fault andworkloadsetsmustbe describedinasystem-independentway thatcanbeappliedto bothtypesof systems.TheDTSarchitecturehasbeendesignedto sup-port Java plugin classesto supportdifferentfault injectionmechanisms,workloads,anddatacollectionstrategies.Seetheuser’smanual[16] for implementationdetails.

Anotherpossibleinterestingapplicationof DTSis avail-ability modeling. Most commercialsystemsthat arecon-cernedwith reliability aredescribedusingavailability num-bers. Usually availability is expressedin ordersof magni-tude(i.e.,numberof nine’sof availability). Thislackof pre-cisionis a resultof thelackof toolsto measuredirectly theavailability of a system.The stateof the art is to combinehumanexperiencewith analyticalmodelsto yield estimatesof availability. TheDTS tool mayplay a role in providingtesting-basedparametersasinput to analyticalmodelsthatwould thenbeableto yield estimatesthataremoreprecise.This might provide thebasisfor work in developingavail-ability benchmarks.

The DTS tool is available for download athttp://www.bell-labs.com/projects/swift/ntdts.

6. Acknowledgments

The authorsgratefully acknowledgethe designandde-velopmenteffort of Chris DingmanandMichael Vogel,aswell as suggestionsand feedbackfrom ChandraKintala.Theauthorsalsorecognizetherole of thereviewersof thispaperin providing invaluablecommentsandsuggestions.

References

[1] J. H. Barton et al. Fault injection experimentsusingFIAT. IEEE Transactionson Computers, 39(4):575–582,Apr. 1990.

[2] J. Carreira,H. Madeira,and J. G. Silva. Xception: Soft-warefault injectionandmonitoringin processorfunctionalunits. In Proceedings5th InternationalWorkingConferenceon DependableComputingfor Critical Applications, pages135–149,Urbana,IL, Sept.1995.

[3] R.A. DeMillo, D. S.Guindi,K. N. King, W. M. McCracken,andA. J.Offutt. An extendedoverview of theMothrasoft-waretestingenvironment. In Proceedingsof the2ndWork-shopon Software Testing, Verification,andAnalysis, pages142–151,Banff, Alberta,July1988.

[4] A. K. GhoshandM. Schmid. WrappingWindows NT bi-naryexecutablesfor failuresimulation.In ProceedingsFastAbstractsandIndustrialPractices9th InternationalSympo-siumonSoftwareReliabilityEngineering(ISSRE’98), pages7–8,Paderborn,Germany, Nov. 1998.

[5] S.Han,K. G. Shin,andH. A. Rosenberg. DOCTOR:An in-tegratedsoftwarefault injectionenvironmentfor distributedreal-timesystems.In InternationalComputerPerformanceandDependabilitySymposium, pages204–213,Apr. 1995.

[6] R. K. Iyer and D. Tang. Experimentalanalysisof com-putersystemdependability. In D. K. Pradhan,editor, Fault-Tolerant ComputerSystemDesign, chapter5, pages282–392.PrenticeHall PTR,UpperSaddleRiver, NJ,1996.

[7] G. A. Kanawati, N. A. Kanawati, andJ.A. Abraham.FER-RARI: A tool for the validation of systemdependabilityproperties. In Proceedings22ndInternationalSymposiumonFault-TolerantComputing, pages336–344,Boston,Mas-sachusets,July1992.

[8] W.-L. Kao andR. K. Iyer. Define:A distributedfault injec-tion andmonitoringenvironment. In Proceedingsof IEEEWorkshopon Fault-Tolerant Parallel and Distributed Sys-tems, June1994.

[9] N. P. Kropp, P. J. Koopman,and D. P. Siewiorek. Auto-matedrobustnesstestingof off-the-shelfsoftware compo-nents. In Proceedings28th International SymposiumonFault-Tolerant Computing(FTCS-28), pages231–239,Mu-nich,Germany, June1998.

[10] Microsoft Windows NT clusters. White Paper, 1997. Mi-crosoftCorporation.

[11] B. P. Miller, D. Koski, C. P. Lee, V. Maganty, R. Murthy,A. Natarajan,andJ.Steidl.Fuzzrevisited:A re-examinationof the reliability of UNIX utilities andservices.TechnicalReportCS-TR-1995-1268,University of Wisconsin,Madi-son,Apr. 1995.

[12] M. Rodr̀iguez, F. Salles, J. C. Fabre, and J. Arlat.MAFALDA: Microkernelassessmentby fault injectionanddesign aid. In Proceedings3rd European DependableComputingConference(EDCC-3), pages143–160,Prague,CzechRepublic,June1999.Springer. LNCS 1667.

[13] SwiFT: Softwareimplementedfault tolerancefor WindowsNT. http://www.bell-labs.com/projects/swift.

[14] T. TsaiandN. Singh. Reliability testingof applicationsonWindows NT. Technicalmemorandum,LucentTechnolo-gies,Bell Labs,MurrayHill, NJ,USA, May 1999.

[15] T. K. TsaiandR. K. Iyer. An approachto benchmarkingoffault-tolerantcommercialsystems.In Proceedings26th In-ternationalSymposiumonFault-TolerantComputing, pages314–323,Sendai,Japan,June1996.

[16] T. K. Tsai and N. Singh. ntDTS User’s Manual. Lu-centTechnologies,Bell Labs,Murray Hill, NJ,USA, 2000.http://www.bell-labs.com/project/swift/ntdts.

[17] J.M. VoasandG. McGraw. Software Fault Injection: Inoc-ulating ProgramsAgainstErrors. JohnWiley & Sons,Inc.,New York, 1998.