Uncertainty Analysis of the Remote Exploration and Experimentation System

Post on 07-May-2023

1 views 0 download

Transcript of Uncertainty Analysis of the Remote Exploration and Experimentation System

Uncertainty Analysis of Reliability of the JPL

Remote Exploration and Experimentation System

Kesari Mishra1 and Kishor S. Trivedi2

Department of ECE, Duke University, Durham, NC, 94089

Raphael R. Some3

Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, 91109

In this paper we discuss a method for computing the uncertainty in model output

metrics due to epistemic uncertainties in the model input parameters. This method

makes use of Monte Carlo sampling and propagates the epistemic uncertainty in the

model parameters, through the system dependability model. It acts as a wrapper to

already existing models and their solution tools/techniques and has a wide range of

applicability. Though it is a sampling based method, no simulation is carried out but

analytic solution of the underlying stochastic model is performed for each set of input

parameter values, sampled from their distributions. Statistical analysis of the output

vector is then performed to obtain the distribution and con�dence intervals of the

model output metrics. We illustrate this method, by applying it to compute the distri-

bution and con�dence interval of the reliability of the NASA Remote Exploration and

Experimentation system. We employ Latin Hypercube Sampling (LHS) procedure and

evaluate the robustness of the output between LHS and random sampling procedures,

for this example.

1 km@ee.duke.edu2 kst@ee.duke.edu3 Raphael.R.Some@jpl.nasa.gov

1

I. Introduction

Complex systems in critical applications employ various hardware and software fault tolerance

techniques to ensure high dependability. Hardware and software redundancy, automatic fault detec-

tion, multiple levels of recovery, non-disruptive failover etc. are employed to handle both transient

and permanent failures [18, 22, 36, 40]. The dependability of these systems and the e�ectiveness

of the fault tolerance techniques being used, are often assessed with the help of analytic stochastic

models. These models capture the natural randomness and hence take into account the aleatory

uncertainty in the system. Randomness in events of interest like times to failure/recovery of com-

ponents, ability to detect failures, ability to perform recovery action etc. are taken into account

by means of their distributions. The stochastic models are solved at �xed parameters values of

these distributions and the outputs thus obtained depend upon the values of the parameters used.

However, when these parameter values are derived from �nite number of observations from real

measurements or guessed at by experts (both very likely in real life), they themselves will have

uncertainty associated with them (known as epistemic uncertainty). This parametric uncertainty is

normally outside the scope of stochastic dependability models as they assume �xed values for each

model input parameter. Due to this uncertainty, the model output computed at �xed parameter

values can be considered to be conditional upon the parameter values used. Unconditioning of the

model output thus obtained can be performed by means of a multi-dimensional integration. Vari-

ous analytic and numerical techniques can be employed to solve this integration however, such an

integration would be very complex and prohibitive for large models and/or a large number of model

input parameters. One method of computing such integrals is the Monte Carlo method.

This paper discusses a method of quantifying the uncertainty in the model output metrics due to

epistemic uncertainties in the model input parameters. The epistemic uncertainty of a parameter can

be speci�ed in terms of the distribution of parameter values or in the form of con�dence intervals or

bounds of the parameter values. In this paper, we will assume that we are given a con�dence interval

for each of the model input parameters. We discuss a sampling based method for the propagation

of parametric epistemic uncertainty to compute the con�dence interval for system reliability. This

method acts as a wrapper to already existing stochastic models and does not need to manipulate

2

or perform complex operations on the model and its outputs, giving it a wide range of applicability

and ease of use. It is also independent of the solution method of the underlying model and pre-

existing model solution methods or tools are relied upon. It does not require the model output to

be a closed-form expression of the input parameters and can be applied to model types ranging

from simple combinatorial models (reliability block diagrams, fault trees etc. [42]) which assume

independence of component failures and repairs, to more complex state space models like Markov

chains, semi-Markov and Markov regenerative processes which capture independence between events

or stochastic Petri nets to capture large state space [4, 10, 15, 25, 28, 41]. This method will also

work with large hierarchical models and �xed-point iterative models [39, 42].

With the help of Bayes' formula [9], the epistemic distribution is derived from the aleatory

distribution, for each model input parameter, using the input uncertainty presented in the form

of con�dence interval of the parameter value. In case the epistemic distribution of a parameter

is provided, it can be used as such. We brie�y explain how to obtain epistemic distribution from

aleatory distribution for the rate parameter of the exponential distribution and the probability

parameter of the Bernoulli distribution. Although Monte Carlo sampling of the input parameters is

performed, we do not perform any discrete-event simulation but carry out the solution of the analytic

model. We employ the Latin Hypercube Sampling (LHS) procedure for uncertainty propagation as it

has been shown to yield a more e�cient estimator of statistics of model output metrics (less variance

at the same sample size) and hence is a more robust sampling procedure than random sampling

[14, 23, 27]. We illustrate this uncertainty propagation method by computing the distribution and

con�dence intervals for reliability of the NASA Remote Exploration and Experimentation system

[2, 5, 20]. We also evaluate the robustness of both LHS and random sampling procedures in our

example. While our example discusses con�dence interval for system reliability, it can be directly

applied to compute uncertainty in other dependability, performance and performability measures,

computed by solving stochastic analytic models.

This paper is organized as follows : Section II discusses the related work in computing un-

certainty in model output metrics due to uncertainty in the input parameter values. Section III

provides an overview of the complete uncertainty propagation method. The example system, its

3

reliability model and availability model are introduced in Section IV. The numerical illustration of

the method using these examples is also carried out in Section IV. Finally Section V summarizes

the paper.

II. Related Work

In this section we discuss some of the previous work done in computing the uncertainty in the

output metrics of analytic stochastic models, due to the uncertainty in the input parameters.

Most analytic methods for parametric epistemic uncertainty propagation perform manipulations

on the closed-form expression of model output metrics and derive either the variance of the output

metric or an exact or approximate con�dence interval.

Under assumption of independent failure of components and a binomial distribution (epistemic)

of probability of failures of individual components, Madansky et. al. [24] compute the exact con�-

dence interval for reliability of series and parallel systems. Exact con�dence interval for reliability

of a series only system has also been calculated by Sarkar [34] and Lieberman et. al. [21], when

the time to failure of each component follows an exponential distribution (aleatory). Both of these

methods use chi-square statistic to compute the con�dence interval of the overall system reliability.

Approximate con�dence interval for reliability of series and parallel systems have been computed

by Easterling et. al. [7] by assuming the overall reliability itself to follow a binomial distribution.

It matches the variance of the reliability computed by the Maximum Likelihood Estimation (MLE)

method to that computed under the binomial assumption and then uses the incomplete beta func-

tion to estimate the con�dence interval for the binomially distributed reliability. A large-sample

approximation to the con�dence interval for the reliability of a series-only system has been pro-

vided by Mawaziny and Buehler [26], which assumes the times to failure of the components to be

exponentially distributed (aleatory).

These analytic methods can compute con�dence interval of system reliability for simple combina-

torial models only (series-only, parallel-only or simple series-parallel models) and for a very limited

range of aleatory or epistemic distributions. Since they perform manipulations on the output of

the model, they require the output to be a closed-form expression of the model input parameters.

4

Clearly, this becomes intractable for larger number of parameters or more complex combinatorial

models.

For more complex combinatorial models, Coit et. al. [6] provide a way of estimating the variance

of system reliability using linearity property of expectation and linear transformations of variance.

This method takes the variance of reliability of individual components, as its input, to estimate

variance of overall system reliability. However, it is not applicable to state space models. Another

variant of this method, which can also be applied to state space models, �rst gets a Taylor series

expansion of the model output and then makes use of the properties of variance and expectation to

compute the variance of overall reliability. This method was also discussed in [43].

Parametric uncertainty analysis has also been used as one of the ways to quantify the uncertainty

in model output metrics due to the epistemic uncertainty in model input parameters, for a wide

range of dependability and performance measures [3, 29, 35, 43].

Some other methods of propagation of epistemic uncertainty (although discussed in the con-

text of deterministic models and not stochastic models), like response surface methodology (RSM),

Fourier amplitude sensitivity test (FAST) and fast probability integration (FPI) have been reviewed

in [14]. The FPI and FAST methods are computationally complex (more so with larger number of

input parameters) and require the model output to be closed-form expressions of the inputs. The

RSM method requires knowledge of experimental design to select the inputs. The uncertainty analy-

sis in RSM method is not carried out directly on the model but on a response surface approximation

to the model, which can be di�cult to construct with high �delity.

Singpurwalla [37] considers the input parameters of the analytic models as unknowable and the

availability or reliability computed using parameter values obtained from measurements or Bayesian

priors, to be conditional upon the values of the parameters. Singpurwalla coins the term survivability

to denote the unconditional value of reliability after the uncertainty in parameter values are taken

into account. This unconditioning can be done by means of a multi-dimensional integration. While

several analytical and numerical methods can be applied to carry out this integration, the problem

becomes intractable for large models and/or a large number of model parameters. An alternate

method of computing such integrals is the Monte Carlo method.

5

Monte Carlo sampling has been used by Haverkort et. al. [13] to propagate the uncertainty

through the model and compute the quantiles of the distribution of model output (along with

mean and variance). Their method assumes the epistemic distributions of parameters to be either

uniform or loguniform, depending on whether the approximate range of values of the parameter

or the approximate order of magnitude of the values of parameter, are known. In our method, we

do not simply assume the epistemic distributions of input parameters but use the real epistemic

distribution of input parameters and derive them using the aleatory distribution type and con�dence

interval of input parameters. In addition, we also employ Latin Hypercube Sampling (LHS) which

has been shown to be a more robust procedure as compared to random sampling procedure used in

[13].

The method discussed in this paper does not require the output of the model to be a closed-form

expression of input parameters and can be applied to a wide range of model types (combinatorial,

state-space and hierarchical models). Our method is easier to use for more complex models as it

does not have to manipulate model output. It is a sampling based non-obtrusive method which acts

as a wrapper to already existing analytic models and their solution methods. The details of this

method and its illustration with the help of uncertainty analysis of the reliability of REE system

are discussed in the rest of paper.

III. Overview of Uncertainty Propagation Method

In this section, we discuss a method for propagating the epistemic uncertainty in the model

input parameters through a stochastic analytic model, to compute the uncertainty in the model

output metrics. Due to the epistemic uncertainties, the model input parameters can be considered

a random vector. Therefore, the model output metric can be considered a random variable that

is a function of these input random variables. If random variables {Θi, i = 1, 2, . . . , k} be the set

of k input parameters, the overall reliability R(t), at time t, can be viewed as a random variable

(function) g of the k input parameters as R(t) = g(Θ1,Θ2, . . . ,Θk). Due to the uncertainty as-

sociated with the model parameters, computing the reliability at speci�c parameter values can be

seen as computing the conditional reliability R(t|Θ1 = θ1,Θ2 = θ2, . . . ,Θk = θk) (denoted by R(t|.)

6

in Equation 1). This can be unconditioned to compute the distribution of reliability via the joint

density fΘ1,Θ2,...,Θk(θ1, θ2, . . . , θk) of the input parameters (denoted by f(.) in Equation 1):

FR(t)(p) =

∫. . .

∫I(R(t|.) ≤ p)f(.)dθ1 . . . dθk (1)

where I(Event) is the indicator variable of the event Event. The unconditional expected reliability

at time t can be computed as shown in Equation 2:

E[R(t)] =

∫. . .

∫R(t|.)f(.)dθ1 . . . θk (2)

Similarly, the second moment of reliability, E[R(t)2] can be computed. With the second moment

and the expected value, the variance of reliability at time t, V ar[R(t)], can then be computed using

the relation V ar[X] = E[X2]− (E[X])2, where X is a random variable.

The task of numerically evaluating these integrals quickly becomes intractable for complex

expressions for system reliability or for larger number of model input parameters. Apart from the

computational problem, the joint epistemic density of all the model parameters, also needs to be

speci�ed. To simplify the evaluation of these integrals, we assume the epistemic random variables to

be independent and hence the joint probability density functions can be factored into the product

of marginals. Then we use a Monte Carlo approach to compute this integral.

Computing R(t|θ1, θ2, . . . , θk) at samples drawn independently from the probability distribution

of each of the input parameters (marginal distributions), FΘi(θi) ( where, i = 1, , k), would yield a

sample of values for the model output metric. The sample of values of reliability thus obtained, can

be analyzed to obtain its distribution and hence the con�dence interval. The steps in this uncertainty

propagation method are summarized in the �owchart shown in Figure 1. We �rst determine the

epistemic distribution and its parameters for each of the model input parameters. The number of

samples to be drawn from the epistemic distributions is then computed and samples are drawn from

the epistemic distribution of each of the model parameters using the Latin Hypercube Sampling

(LHS) procedure. The set of output values obtained by solving the model at each set of sampled

parameter values is �nally analyzed to obtain the con�dence interval of the model output metric

(unconditional reliability). The remainder of this section discusses each of these steps in details.

7

Start

i = 1

Determine the epistemic distribution of input parameter Θi

i++

i < k

Determine the sample size n for the model output

j = 1

Generate random deviate from each epistemic distribution.Generate input parameter vector vj = [θ1j , θ2j , . . . θkj ]

Solve stochastic analytic model to get R(t)j = g(vj)

j ++

j < n

Analyze model outputConstruct empirical CDF of R(t) from values {R(t)1, R(t)2, . . . , R(t)n}

Infer the confidence interval of Reliability from the CDF

End

Yes

No

Yes

No

Fig. 1 Flow Chart of Uncertainty Propagation Method

A. Determination of epistemic distributions

Since our method is based on sampling from the epistemic distributions of the model parameters,

we �rst need to determine the epistemic distribution of the model input parameters, from the

uncertainty in the input parameter value and the aleatory distribution type. In this paper we

assume the input parameter uncertainty to be presented to us in the form of con�dence intervals. It

has been shown that the posterior distribution obtained by applying Bayes' theorem to the likelihood

of input parameters (based on the aleatory distributions) and an appropriate non-informative prior

[9], provides the epistemic distribution of parameters of the aleatory distribution [38]. In our case

(assuming input uncertainty to be provided as con�dence interval), this method of determining the

epistemic distribution requires the knowledge of the aleatory distribution type and the number of

observations that would have been used to infer the con�dence interval of the parameter.

8

In this subsection, we �rst discuss computing the number of observations, ri, that would have

been used to compute the point estimate and con�dence interval of ith model parameter, (i =

1, 2, , k). Aleatory distributions for the times to failure (exponential) or to successfully detect or

recover from failures (Bernoulli) have been considered. Then we discuss determining the epistemic

posterior distribution, making use of the aleatory distribution and the number of observations

computed.

In the rest of the paper we will consider ri, (i = 1, 2, , k), to be the number of observations that

would have been used to compute the point estimate and con�dence interval of the model input

parameter θi (parameter of the aleatory distribution). However, in this subsection. when determin-

ing the number of observations for di�erent aleatory distribution types, we elide the subscript and

denote it as simply r.

1. Computation of the number of observations

We compute the number of observations (e.g., the number of failures observed or number of

failures recovered from) that would have been used to compute the con�dence interval of the input

parameter by inverting the relation between the width of the con�dence interval, the number of

observations and the point estimate of the input parameter. We consider determining the number

of observations when the aleatory distribution for time to failure follows an exponential distribution

and the aleatory distribution of successfully detecting or recovering from failures, follow a Bernoulli

distribution.

Exponential distribution : Number of observations from two-sided con�dence interval

When the time to failure of a component is exponentially distributed, the point estimate of the

rate parameter λ of the exponential distribution is given by λ = r/sr, where, r is the number of

observed failures during the period of observation and sr is the value of accumulated life on test

random variable Sr. Making use of the upper and lower limits of 100(1−α)% two sided con�dence

interval of λ [30, 41], the half width of the con�dence interval of λ, given by d, can be determined

as in Equation 3.

d =1

4sr

{χ22r,α/2 − χ2

2r,1−α/2

}(3)

9

where, χ22r,1−α/2 is the critical value of chi-square distribution with 2r degrees of freedom. From

Equation 3, the number of observations (number of observed failures, repairs etc.) that would have

resulted in a given half width d, at a con�dence coe�cient (1 − α) can be computed as shown in

Equation 4.

r =

⌈λ

4d

{χ22r,α/2 − χ2

2r,1−α/2

}⌉(4)

Starting with an initial assumption (using normal approximation to the chi-square distribution) for

the value of r, the �xed-point Equation 4 is solved iteratively to converge at a value of r. Similar

reasoning can be used to to compute r when upper or lower one-sided con�dence interval is provided.

Bernoulli distribution : Number of observations from one-sided con�dence interval

The point estimate of the coverage probability is given by c = sr/r, where sr is the value of the

random variable Sr, denoting the number of faults/errors detected and recovered and r is total

number of faults/errors injected. The lower limit of the upper one sided 100(1 − α)% con�dence

interval for the coverage probability is given by Equation 5 [41]:

cL = 1−χ22(r−sr+1),α

2r(5)

Inverting Equation 5, the number of injections that would have resulted in this lower limit of the

con�dence interval, cL, at a con�dence coe�cient (1− α) can be obtained as in Equation 6:

r =

⌈χ22(r(1−c)+1),α

2(1− cL)

⌉(6)

Equation 6 is iteratively solved to obtain the value of r, using an initial approximation for r. The

initial value is calculated using the normal approximation of the chi-square distribution.

2. Determining the epistemic distributions

As discussed earlier, the posterior distribution obtained by applying Bayes' theorem to the

likelihood of input parameters (based on the aleatory distributions) and a non-informative prior [9],

provides the epistemic distribution of parameters of the aleatory model [38]. The likelihood function

of the model parameters need the number of observations that would have been used to compute

the given con�dence interval for the model parameter, as computed earlier in Section IIIA 1.

10

In the example in this paper we have modeled the aleatory distributions of times to failure

by the exponential distribution. Choosing a non-informative prior, the epistemic distribution of

the parameter of exponential distribution results in a gamma distribution with a positive integer

parameter [9, 38] and hence an Erlang distribution.

If the number of observations that would have been used to compute the ith parameter of the

availability model (a rate parameter) θi, be ri and sri would have been the value of accumulated

time on test for the corresponding component, then choosing a non-informative prior, the posterior

density for the rate parameter Θi (hence the epistemic density), is the gamma density [9]:

fΘi|Sri(θi|sri) = gamma(ri + 1, sri) (7)

where sri = ri/θi

The aleatory uncertainty in various coverage probabilities have been modeled using Bernoulli

distribution. If the number of injections that would have been used to compute the value of jth model

parameter (coverage probability), θj be rj , then, choosing a non-informative prior, the posterior

density function (epistemic density function) is known to be the beta density function [9]:

f(θj |yj) = beta(yj + 1, rj − yj + 1) (8)

where, yj is the number of successfully handled faults out of a total of rj injected faults.

B. Determination of the number of samples from epistemic distributions

Once the epistemic distributions of the model input parameters are determined, we draw sam-

ples from each of the epistemic distribution (since we assume the parameters to be independent

random variables). However, we need to determine the number of samples to be drawn from these

distributions to obtain the minimum sample size required for the con�dence interval of the output

measure, at a given con�dence level . We consider the number of observations ri that would have

been used to compute the point estimate and con�dence interval of each model input parameter

Θi (i = 1, 2, . . . , k) as well as the sample size, mo (o = 1, 2, . . . , l), of the output measures, ∆o, in

determining this sample size. The total number of samples to be drawn from the epistemic dis-

tribution of each model parameter is computed as n = max{r1, r2, . . . , rk,m1,m2, . . . ,ml}. The

11

sample size based on the output measure(s) is considered only if a desired con�dence interval of any

of the output measures is provided. If the desired width (or half width) of the con�dence interval

of any of the output measures of the model is provided, we assume the output measure to follow a

normal distribution, to compute the number of samples mo of the output measure ∆o, and invert

the relation between the half width of con�dence interval and the number of samples to obtain mo.

C. Sampling Procedure

Subsequent to the determination of epistemic distribution of each input parameter, samples or

random deviates need to be drawn from these distributions. We employ Latin Hypercube Sampling

(LHS) procedure in our uncertainty propagation method. Latin Hypercube Sampling method (LHS)

divides the entire probability space into equal intervals (the number of intervals is equal to the

number of samples needed) and draws a random sample from each of the intervals. Thus it is almost

guaranteed to address the entire probability space evenly and easily reach low probability - high

impact areas of the epistemic distribution of input parameters. Once the samples are generated,

random pairings without replacement is carried out between samples of di�erent parameters, to

ensure randomness in sequence of the samples. It has been shown that for a given sample size, a

statistic of model output obtained by LHS sampling procedure, will have a variance less than or

equal to that obtained by random sampling procedure. In other words, at a particular sample size,

LHS sampling yields a more e�cient estimator for a statistic of model output than that by random

sampling, providing a more robust sampling procedure [14, 23, 27]. The actual samples of random

numbers or the random deviates can be generated by any of the methods like inverse transform

method, rejection sampling, Box-Mueller transform or Johnson's translation [19]. A sample from

the distribution of each of the k model input parameters will result in a vector of k parameter

values. We use the LHS sampling procedure to draw a total number, n, of samples (sample size as

determined in Section III B) from each distribution. Therefore, n such vectors (v1 through vn) will

be generated as shown in Equation 9.

12

v1

v2

...

vn

=

θ11, θ21, θk1

θ12, θ22, θk2

...

θ1n, θ2n, θkn

(9)

In the examples in the paper, for the rate parameters in the models, we sample from the gamma

distribution. For the coverage factors in the models, we sample from the beta distribution, as

determined in Section IIIA 2.

D. Solving the Analytic Model

The reliability model is solved at each of the n sets of sampled input parameter values obtained

from the epistemic distributions to obtain a set of values for the model output, as shown in Equation

10. The model may be solved by using software packages like SHARPE [42] or if closed-form solutions

exist, they may be solved programatically by simple user programs. This method is independent of

the model solution method and any pre-existing model solution technique can be applied. It also

does not require the model output to be a closed-form expression of model input parameters as it

does not need to manipulate the model output(s).

{R(t)1, R(t)2, . . . , R(t)n} = {g(v1), g(v2), . . . , g(vn)} (10)

E. Statistical Analysis of the Output

Once the set of model output values {R(t)1, R(t)2, , R(t)n} is obtained, to quantify the uncer-

tainty in the output due to the epistemic uncertainties in model input parameters, we compute

the con�dence interval of the output metric(s), at a desired con�dence level. A non-parametric

method is used to calculate the con�dence interval of the output measure, as it would obviate the

need to make any assumptions about the distribution of the model output or to �t a distribution

to the model output values. An empirical Cumulative Distribution Function (CDF) [41] of the set

of output values {R(t)1, R(t)2, , R(t)n}, is constructed . The values of the output measure at ap-

propriate percentile points, from the CDF, provide the con�dence interval at the desired con�dence

13

level. In our examples, we use the values of the model output corresponding to 2.5th and 97.5th

percentile, as the limits of the 95% two sided con�dence interval. Similarly the value corresponding

to 5th percentile is chosen as the lower limit of the 95% upper one sided con�dence interval and the

value corresponding to the 95thpercentile is chosen as the upper limit of the 95% lower one sided

con�dence interval.

IV. Illustrative Example

We apply the uncertainty propagation method discussed so far, to the reliability analysis of the

NASA Remote Exploration and Experimentation (REE) system [2, 20]. The NASA Jet Propulsion

Laboratory REE Project was a large multi-year technology demonstration project to develop a

low-power, scalable, fault-tolerant, high-performance computing platform for use in space and to

demonstrate that signi�cant on-board computing enables a new class of scienti�c missions. A REE

testbed was developed to test, re�ne, and validate scalability of architectures and system approaches

to achieve the dependability goals.

The REE system was expected to provide continuous operation through graceful degradation

despite experiencing transient or permanent failures. It was expected to experience a small number

of transient component failures per day, induced by radiations. The permanent component failures

were expected to be experienced a small number of times over several years. The system provided

fault detection and recovery mechanisms so that applications could operate in the presence of faults.

Availability and reliability of the REE system was analyzed in [5], with the help of several models,

to assess its fault-tolerance features. These models were speci�ed and solved using the SHARPE

[42] software package.

The REE system is a collection of processing elements (referred to as nodes) connected together

by a Myrinet [1]. Each node is a commercially available computer running commercial o�-the-shelf

(COTS) operating system based on UNIX. The architecture of REE system is shown in Figure 2.

Redundancy in hardware, software, time and data are used to achieve fault tolerance. Fault detection

and recovery are provided by Software implemented fault-tolerance (SIFT) and system executive

(SE) [8, 31�33]. SE is responsible for local error detection and failure recovery while the system

14

I/O Servers I/O ServersSCP

REE SystemSystem Executive (SE)Application Middleware OS Application Middleware OS Application Middleware OSProcessors Memory Controller Myrinet I/Fs Processors Memory Controller Myrinet I/Fs Processors Memory Controller Myrinet I/FsSoftwareHardware

MyrinetFig. 2 REE System Architecture

control processor (SCP) provides fault detection and recovery at the system level. The SCP relies

on periodic node health status messages or heartbeats from SE on each node to determine their

status. Nodes can be individually reset/restarted by either the SE or the SCP.

A service request to the system is assigned to multiple nodes to be processed in parallel, by a

fault-tolerant scheduler. Same software components have di�erent implementations across di�erent

nodes in the system, to provide redundancy and design diversity. Results from these nodes are then

compared by SE to detect and tolerate faults. The intermittent and transient faults [17] can be

detected and recovered from, by SIFT, SE or SCP. However, as the REE system operates without

human intervention, components cannot be replaced after permanent failures. Hence the permanent

failures require the system to be recon�gured to a degraded mode of operation and a�ect the long

term behavior of the system. Figure 3 captures the fault detection and recovery scheme employed

in REE.

The hardware and software component failures can be transient or permanent in nature. Ma-

jority of the transient faults are masked due to the use of error correcting codes and redundant

hardware. Unmasked transient failures may freeze the processors making the system unresponsive

15

Collect Information from SEPeriodic InspectionNode RejuvenationWhole REE SystemSCP

Inside NodeMiddleware (SIFT)Fault ? NoYes

Successful ? YesNoNode Fail; Report to SEComponent Recovery (Retry, Rollback)Working; Checkpointing; Heartbeat to SE; Error Detection(Acceptance Test)

Inside ClusterSEWorking and Error Detection(Heartbeat from Nodes, Acceptance Test, Other Fault Detection Mechanism)2 or 3 SE Working ? NoYesVoteFault ?No YesNode Recovery(Restart, Reboot)Successful ? NoYes Fault Isolation (Power off the Faulty Node)Spare Node ?No YesReport to SCPFailover (Transfer Task)

Fig. 3 Flowcharts

or generate erroneous outputs. Heartbeat messages are relied upon, by the SE to detect unrespon-

sive or frozen nodes. Erroneous outputs are detected with the help of acceptance tests by SIFT.

Upon detection of either an unresponsive node or erroneous output, the node is rebooted after

transferring its tasks to other spare nodes (the tasks are restarted from the last checkpoint). In

addition, SCP selectively reboots several nodes as a preventive action, according to their uptime

and current health status. This preventive action, which is called rejuvenation, helps in getting rid

of latent faults and e�ectively tolerate �soft� or aging related faults such as memory leaks [12].

Transient and intermittent faults are assumed to not a�ect reliability of the system as after

recovery actions, the failed nodes will eventually return to use. If there is a permanent fault

involved, the faulty node cannot be rebooted successfully. In this case, a noti�cation is sent to SE,

the faulty node will be isolated and the number of working nodes will decrease. Eventually, when

the number of working nodes is smaller than the minimum requirement (2 in the testbed), the whole

system fails.

We use the reliability model of the REE system as an example to illustrate the uncertainty

propagation method. Hence we will only consider the permanent faults in the system.

16

A. Reliability Model

A two-stage hierarchical model [42] is used to capture the failure behavior of the system and

its components in [5]. Here we modify the model to make it a three-stage hierarchical model to

capture the failure behavior of the software component in greater details. At the lowest level, a

Markov model captures the failure behavior of the software subsystem. In the middle level, the

failure behavior of individual components is captured using reliability block diagrams (RBD). The

reliability computed from the Markov model, at the lowest level, provides one of the inputs to

the RBD. The top level model is a fault tree that takes the reliability of each of the individual

component (obtained by solving the lower level model), as an input and takes into account the

interactions between the individual components to provide the overall system reliability.

The REE architecture uses redundancy at several levels. Each node uses redundancy in the form

of spare processor chips and spare Myrinet interfaces to increase its reliability. The node subsystem

is con�gured as a k-out-of-n system of such nodes (with k = 2 in the prototype implementation).

The I/O subsystem also has redundancy in its con�guration and is implemented as a parallel system

with several redundant nodes. All the nodes in node system as well as the I/O system are assumed to

exhibit the same stochastic behavior regarding faults. The faults are considered to be mutually in-

dependent. Since REE system operates without human intervention, no replacement of components

after a permanent failure has been assumed. It is these permanent failures of components that are

considered in the reliability model. Most software failures in the nodes are expected to be recovered

from by retry, process restart or node reboot. The Markov model shown in �gure 4 captures these

escalated levels of recovery actions. The software component is considered to have failed only if the

software failures cannot be recovered from, even after node reboot. Software failures that cannot be

recovered from even after retry, restart or reboot can be identi�ed with Bohrbugs while those that

can be recovered from by use of these actions, can be identi�ed with Mandelbugs [11, 12].

The RBD model for a node is shown in Figure 5. Figure 6 shows the higher level fault tree

to compute the reliability of the entire system. In Figures 5 and 6, the failure rates of various

components are as follows: Myrinet (λnet), Processor (λpro), Memory ( λmem), Memory Controller

(λbmc), Node Controller (λndc), Non-Volatile Memory ( λnvm), PCI Bus (λbus), Myrinet Network

17

ρ∗s

UP UA UR UB FAILswλ

α∗c

α∗− )1( c

β*b

ρ*)1( s− β*)1( b−

Fig. 4 Markov model for Reliability of Software

SoftwareProcessorProcessor Memory MemoryController NodeController Non-VolatileMemory PCIBus Myrinet I/FMyrinet I/FFig. 5 RBD Model for Node Level Reliability

I/F (λnif ) and Software (λsw). Rates of failure, retry, restart and reboot as well as the coverage

REE System Failure

(n-1) of n

IOS IOS IOS IOS

Myrinet

Node System

Fig. 6 Fault Tree Model for REE System Reliability

probabilities in the Markov model in �gure 4 are as follows: Software failure (λsw), Retry rate (α),

Restart rate (ρ), Reboot rate (β) and the coverage probabilities for retry, restart and node reboot

- (c, s and b, respectively).

18

B. Uncertainty Propagation through the reliability model

The time to failure of individual components in the RBD model in �gure 5 are assumed to

follow an exponential distribution. Table 1 summarizes the point estimate of the parameters of the

time to failure distribution of various components in the REE system. These values were obtained

from [5]. To represent the uncertainty in the model input parameters, we choose the half-width of

95% two-sided con�dence interval of di�erent parameters to be 15% to 33% of the point estimate.

Parameter Value

λpro 0.01 per year

λmem 0.03 per year

λbmc 0.01 per year

λndc 0.01 per year

λnvm 0.01 per year

λbus 0.02 per year

λnif 0.03 per year

λnet 0.0006 per year

λsw 8.76 per year

α 31536000 per year

ρ 3153600 per year

β 52560 per year

c 0.9

r 0.9

b 0.9

Table 1 Point estimates of reliability model parameters

The steps in uncertainty propagation through the reliability model of the REE system are as

follows :

(a) Determining the epistemic distribution of model parameters : We �rst determine the

number of observations that would have been used to compute the point estimate and 95% con�dence

interval, for each model parameter, as explained in Section IIIA 1. The number of observations, thus

19

calculated for each model input parameter, is summarized in Table 2. Since the times to failures of

the components in the system have been assumed to follow exponential distribution, as explained

in section IIIA 2, the epistemic distributions of the parameter values are calculated to be gamma

distribution with parameters as summarized in Table 2.

Parameter Num. Observations Epistemic Distributions

λpro 117 gamma(118,8.57e-5)

λmem 167 gamma(168,1.79e-4)

λbmc 92 gamma(93,1.08e-4)

λndc 117 gamma(118,8.54e-5)

λnvm 103 gamma(104,9.71e-5)

λbus 146 gamma(147,1.37e-4)

λnif 208 gamma(209,1.44e-4)

λnet 19 gamma(20,4.21e-5)

λsw 65 gamma(66,7.42)

β 81 gamma(82,1.54e-3)

b faults injected : 57, faults detected 51 beta(52,5)

Table 2 Epistemic distributions of parameters

(b) Determination of number of samples from epistemic distributions : The number

of samples needed from the distribution of each input parameter is determined to be n = 209, largest

of all the number of observations of the input parameters.

(c) Sampling Procedure: To implement the LHS procedure, we use Mathematica's built-in

function RandomReal to generate random deviates from Uniform distribution in the intervals the

probability space has been divided into, for LHS sampling. Using those random deviates, we obtain

quantiles from gamma or beta distribution using Mathematica's function Quantile. We also use

the function RandomSample to obtain random pairing without replacement, between samples from

di�erent epistemic distributions.

(d) Solving the Analytic Model : At each of the n = 209 set of values of the parameters,

the hierarchical model is solved using SHARPE [16].

(e) Summarizing the Output Values : The empirical Cumulative Distribution Function

20

(CDF) of the reliability of the system output by both methods is constructed from the n = 209

values of output, obtained by solving the system availability model using SHARPE. Figure 7 shows

the empirical CDF of reliability of the REE system at a time t = 5 years.

Fig. 7 CDF of Reliability of REE System at t = 5 years

The 5th percentile from the CDF provides the lower limit for the 95% upper one-sided con�dence

interval of reliability, which is computed to be (0.956339, 1). The upper and lower limits of the

two-sided 95% con�dence interval of reliability are provided by the 97.5th percentile and 2.5th

percentile, respectively. The two-sided 95% con�dence interval of reliability is computed to be

(0.951727, 0.981611).

To evaluate the robustness of LHS procedure as compared to the random sampling procedure, 20

runs of the entire uncertainty propagation method, each run with 209 samples from each epistemic

distribution were performed, by both the LHS sampling procedure as well as the random sampling

procedure. The variance of sample mean of the reliability at time t = 5 years, computed in each of

the 20 iterations, was computed as an indicator of the robustness of the sampling procedure. This

variance obtained by using LHS sampling procedure in the uncertainty propagation method was

4.17e−8, which was less than that computed by using the random sampling procedure as 1.56e−7.

The empirical CDF of means of reliability obtained in the 20 executions by both the sampling

procedures are summarized in Figure 8. It can be shown that the variance of other statistics and

quantiles of the distribution of reliability will also have lower variance with LHS procedure. Thus

21

in this example, LHS provides a more robust sampling procedure.

Fig. 8 CDF of Mean of Reliability, computed by LHS and Random Sampling Procedures

V. Summary

In this paper, a method for computing the uncertainty in model output metrics, due to the

epistemic uncertainties in the model input parameters, has been presented. This method employs

Monte Carlo sampling to compute the con�dence interval and distribution of the model output

metric. It is a non-obtrusive method and acts as a wrapper to already existing analytic stochastic

models and their solution techniques. It works for a wide range of model types and can easily be

applied to complex models as it does not require the model output to be a closed form expression of

the input parameters. While we apply this method to compute the con�dence interval of reliability

of the REE system as an illustration, it can be directly applied to compute the con�dence interval

of other dependability, performance and performability measures, computed by solving stochastic

analytic models. Although this method entails Monte Carlo sampling of the input parameter space,

there is no simulation carried out. For each vector of sampled model parameter values, numerical

solution of the analytic model is carried out. We also demonstrate the greater robustness of LHS

sampling procedure over the random sampling procedure, in our example. This method does not

assume any distribution for the model output or �ts any distribution to it. It uses appropriate

percentiles from the empirical CDF of the model output to compute its con�dence interval. Meth-

22

ods of obtaining the distribution of the parameters of the model (epistemic uncertainty) from the

distribution of observed system events (aleatory uncertainty) have also been discussed.

Acknowledgement

The authors would like to thank Dr. S. Dharmaraja and Ms. Vandana Khaitan of Indian

Institute of Technology, Delhi, for valuable discussions and help with the models.

This paper is submitted as part of a special section on the New Millenium Program ST-8 project.

References

[1] Myrinet overview. In http://www.myri.com/myrinet/overview/.

[2] Ree project overview. In http://www-ree.jpl.nasa.gov/overview.html.

[3] J.T. Blake, A.L. Reibman, and K.S. Trivedi. Sensitivity analysis of reliability and performability

measures for multiprocessor systems. In 1988 ACM SIGMETRICS conf. on Measurement and modeling

of computer systems.

[4] G. Bolch, K.S. Trivedi, S. Greiner, and H. de Meer. Queueing Networks and Markov Chains: Modeling

and Performance Evaluation with Computer Science Applications. J. Wiley & Sons, 2006.

[5] D. Chen, S. Dharmaraja, D-Y. Chen, L. Li, K.S. Trivedi, R. R. Some, and A.P. Nikora. Reliability and

availability analysis for the jpl remore exploration and experimentation system. In In Proc. Int. Conf.

on Dependable Systems and Networks, pages 337�344, 2002.

[6] D. W. Coit. System reliability con�dence intervals for complex systems with estimated component

reliability. IEEE Trans. on Reliability, 46(4):487�493, Dec 1997.

[7] R. G. Easterling. Approximate con�dence limits system reliability. Journal of the American Statistical

Association, 67(337):220�222, Mar 1972.

[8] S. Garg, Y. Huang, C. Kintala, and K.S. Trivedi. Minimizing completion time of a program by check-

pointing and rejuvenation. In SIGMETRICS, 1996.

[9] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC,

second edition, 2003.

[10] R. German. Performance Analysis of Communication Systems: Modeling with Non-Markovian Stochas-

tic Petri Nets. Kluwer Academic Publishers, 2000.

[11] M. Grottke, A.P. Nikora, and K.S. Trivedi. An empirical investigation of fault types in space mission

system software. In 40th Annual IEEE/IFIP Intl. Conf. on Dependable Systems and Networks (DSN

23

2010), 2010.

[12] M. Grottke and K.S. Trivedi. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Computer,

40(2):107�109, 2007.

[13] B.R. Haverkort and A.M.H. Meeuwissen. Sensitivity & uncertainty analysis of markov-reward models.

IEEE Trans. on Reliabiility, 44(1):147�154, March 1995.

[14] J.C. Helton and F.J. Davis. Latin hypercube sampling and propagation of uncertainty in analysis of

complex systems. Reliability Engineering and System Safety, 81(1):23�69, 2003.

[15] J. Hilllston. A Compositional Approach to Performance Modelling. Cambridge University Press, 1996.

[16] C. Hirel, R.A. Sahner, X. Zang, and K.S. Trivedi. Reliability and performability modeling using sharpe

2000. In 11th Intl. Conf. on Computer Performance (TOOLS2000), 2000.

[17] O. Ibe, R. Howe, and K.S. Trivedi. Approximate availability analysis of vaxcluster systems. IEEE

Trans. Reliiability, 38:146�152, 1989.

[18] B.W. Johnson. Design and analysis of fault-tolerant digital systems. Addison-Wesley, 1989.

[19] M.E. Johnson. Multivariate Statistical Simulation. John Wiley & Sons, New York.

[20] J.H. Lala and J.T. Sims. A dependability architecture framework for remote exploration & experimen-

tation computers. In The 29th Int.Symp. Fault-tolerant Computing, 1999.

[21] G. J. Leiberman and Sheldon M. Ross. Con�dence intervals for independent exponential series systems.

Journal of the American Statistical Association, 66(336):837�840, Dec 1971.

[22] M.R. Lyu and V.B. Mendiratta. Software fault tolerance in a clustered architecture: techniques and

reliability modeling. In 1999 IEEE Aerospace Conference.

[23] Iain A Macdonald. Comparison of sampling techniques on the performance of monte-carlo based sen-

sitivity analysis. In 11th. Intl. IBPSA Conference, 2009.

[24] A. Madansky. Approximate con�dence limits for the reliability of series and parallel systems. Techno-

metrics, 7(4):495�503, Nov. 1965.

[25] M.A. Marsan, G. Balbo, G. Conte, S. Donatelli, and G. Franceschinis. Modeling with Generalized

Stochastic Petri Nets. J. Wiley & Sons, 1995.

[26] A.H. El Mawaziny and R.J. Buehler. Con�dence limits for the reliability of series systems. Journal of

the American Statistical Association, 62(320):1452�1459, Dec 1967.

[27] M.D. McKay, R.J. Beckman, and W.J. Conover. A comparison of three methods for selecting values of

input variables in the analysis of output from a computer code. Technometrics, 42(1):55�61, 2000.

[28] J. Muppala, R. Fricks, and K.S. Trivedi. Techniques for system dependability evaluation. In Compu-

tational Probability, W. Grassman (ed.), pages 445�480. Kluwer Academic Publishers, 2000.

24

[29] A.V. Ramesh and K.S. Trivedi. On the sensitivity of transient solutions of markov models. In 1993

ACM SIGMETRICS conf. on Measurement and modeling of computer systems.

[30] D. Rasch. Sample size determination for estimating the parameter of an exponential distribu-

tion. Akademie der Landwirtacheftswissenschaften der DDR Forschungszentrurn �ir Tierproduktion

Dummerstorf-Rostock, 19:521�528, 1977.

[31] D.A. Rennels, D.W. Caldwell, R. Hwang, and K. Mesarina. A fault-tolerant embedded microcontroller

testbed. In Paci�c Rim Int. Symp. Fault Tolerant Systems (PRFTS97), 1997.

[32] J.A. Rohr. Starex self-repair routines: software recovery in the jpl-star computer. In The 25th Int.

Symp. Fault-tolerant Computing, 1995.

[33] J.A. Rohr. Software-implemented fault tolerance for supercomputing in space. In The 28th Int. Symp.

Fault-Tolerant Computing, 1998.

[34] T. K. Sarkar. An exact lower con�dence bound for the reliability of a series system where each compo-

nent has an exponential time to failure. Technometrics, 13(3):535�546, Aug 1971.

[35] N. Sato and K.S. Trivedi. Stochastic modeling of composite web services for closed-form analysis of their

performance and reliability bottlenecks. In 5th. Intl. Conf. on Service Oriented Computing, ICSOC,

2007.

[36] D.P. Siewiorek and R. S. Swarz. Reliable Computer Systems: Design and Evaluation. AK Peters, Ltd,

1998.

[37] N. D. Singpurwalla. Reliability and Risk: A Bayesian Perspective. John Wiley & Sons, 1 edition, 2006.

[38] M. Stamatelatos, G. Apostolakis, H. Dezfuli, C. Everline, S. Guarro, P. Moeini, A Mosleh, T. Paulos,

and R. Youngblood. Probabilistic risk assessment procedures guide for nasa managers and practitioners.

http://www.hq.nasa.gov/o�ce/codeq/doctree/praguide.pdf, Ver. 1.1, 2002.

[39] L.A. Tomek and K.S. Trivedi. Fixed point iteration in availability modeling. In 5th Intl. GI/ITG/GMA

Conf. on Fault-Tolerant Computing Systems, pages 229�240, 1991.

[40] K. Trivedi, D. Wang, D.J. Hunt, A. Rindos, W.E. Smith, and B. Vashaw. Availability mpdeling of sip

protocol on ibm websphere. In Proc. of Paci�c Rim Dependability Conference, 2008.

[41] K.S. Trivedi. Probability and Statistics with Reliability, Queuing and Computer Science Applications.

J. Wiley & Sons, New York, 2001.

[42] K.S. Trivedi and R. Sahner. Sharpe at the age of twenty two. SIGMETRICS Performance Evauation

Review, 36(4):52�57, 2009.

[43] Liang Yin, M.A.J. Smith, and K.S. Trivedi. Uncertainty analysis in reliability modeling. Reliability

and Maintainability Symposium, pages 229�234, 2001.

25