Uncertainty Analysis of the Remote Exploration and Experimentation System
-
Upload
independent -
Category
Documents
-
view
1 -
download
0
Transcript of Uncertainty Analysis of the Remote Exploration and Experimentation System
Uncertainty Analysis of Reliability of the JPL
Remote Exploration and Experimentation System
Kesari Mishra1 and Kishor S. Trivedi2
Department of ECE, Duke University, Durham, NC, 94089
Raphael R. Some3
Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, 91109
In this paper we discuss a method for computing the uncertainty in model output
metrics due to epistemic uncertainties in the model input parameters. This method
makes use of Monte Carlo sampling and propagates the epistemic uncertainty in the
model parameters, through the system dependability model. It acts as a wrapper to
already existing models and their solution tools/techniques and has a wide range of
applicability. Though it is a sampling based method, no simulation is carried out but
analytic solution of the underlying stochastic model is performed for each set of input
parameter values, sampled from their distributions. Statistical analysis of the output
vector is then performed to obtain the distribution and con�dence intervals of the
model output metrics. We illustrate this method, by applying it to compute the distri-
bution and con�dence interval of the reliability of the NASA Remote Exploration and
Experimentation system. We employ Latin Hypercube Sampling (LHS) procedure and
evaluate the robustness of the output between LHS and random sampling procedures,
for this example.
1 [email protected] [email protected] [email protected]
1
I. Introduction
Complex systems in critical applications employ various hardware and software fault tolerance
techniques to ensure high dependability. Hardware and software redundancy, automatic fault detec-
tion, multiple levels of recovery, non-disruptive failover etc. are employed to handle both transient
and permanent failures [18, 22, 36, 40]. The dependability of these systems and the e�ectiveness
of the fault tolerance techniques being used, are often assessed with the help of analytic stochastic
models. These models capture the natural randomness and hence take into account the aleatory
uncertainty in the system. Randomness in events of interest like times to failure/recovery of com-
ponents, ability to detect failures, ability to perform recovery action etc. are taken into account
by means of their distributions. The stochastic models are solved at �xed parameters values of
these distributions and the outputs thus obtained depend upon the values of the parameters used.
However, when these parameter values are derived from �nite number of observations from real
measurements or guessed at by experts (both very likely in real life), they themselves will have
uncertainty associated with them (known as epistemic uncertainty). This parametric uncertainty is
normally outside the scope of stochastic dependability models as they assume �xed values for each
model input parameter. Due to this uncertainty, the model output computed at �xed parameter
values can be considered to be conditional upon the parameter values used. Unconditioning of the
model output thus obtained can be performed by means of a multi-dimensional integration. Vari-
ous analytic and numerical techniques can be employed to solve this integration however, such an
integration would be very complex and prohibitive for large models and/or a large number of model
input parameters. One method of computing such integrals is the Monte Carlo method.
This paper discusses a method of quantifying the uncertainty in the model output metrics due to
epistemic uncertainties in the model input parameters. The epistemic uncertainty of a parameter can
be speci�ed in terms of the distribution of parameter values or in the form of con�dence intervals or
bounds of the parameter values. In this paper, we will assume that we are given a con�dence interval
for each of the model input parameters. We discuss a sampling based method for the propagation
of parametric epistemic uncertainty to compute the con�dence interval for system reliability. This
method acts as a wrapper to already existing stochastic models and does not need to manipulate
2
or perform complex operations on the model and its outputs, giving it a wide range of applicability
and ease of use. It is also independent of the solution method of the underlying model and pre-
existing model solution methods or tools are relied upon. It does not require the model output to
be a closed-form expression of the input parameters and can be applied to model types ranging
from simple combinatorial models (reliability block diagrams, fault trees etc. [42]) which assume
independence of component failures and repairs, to more complex state space models like Markov
chains, semi-Markov and Markov regenerative processes which capture independence between events
or stochastic Petri nets to capture large state space [4, 10, 15, 25, 28, 41]. This method will also
work with large hierarchical models and �xed-point iterative models [39, 42].
With the help of Bayes' formula [9], the epistemic distribution is derived from the aleatory
distribution, for each model input parameter, using the input uncertainty presented in the form
of con�dence interval of the parameter value. In case the epistemic distribution of a parameter
is provided, it can be used as such. We brie�y explain how to obtain epistemic distribution from
aleatory distribution for the rate parameter of the exponential distribution and the probability
parameter of the Bernoulli distribution. Although Monte Carlo sampling of the input parameters is
performed, we do not perform any discrete-event simulation but carry out the solution of the analytic
model. We employ the Latin Hypercube Sampling (LHS) procedure for uncertainty propagation as it
has been shown to yield a more e�cient estimator of statistics of model output metrics (less variance
at the same sample size) and hence is a more robust sampling procedure than random sampling
[14, 23, 27]. We illustrate this uncertainty propagation method by computing the distribution and
con�dence intervals for reliability of the NASA Remote Exploration and Experimentation system
[2, 5, 20]. We also evaluate the robustness of both LHS and random sampling procedures in our
example. While our example discusses con�dence interval for system reliability, it can be directly
applied to compute uncertainty in other dependability, performance and performability measures,
computed by solving stochastic analytic models.
This paper is organized as follows : Section II discusses the related work in computing un-
certainty in model output metrics due to uncertainty in the input parameter values. Section III
provides an overview of the complete uncertainty propagation method. The example system, its
3
reliability model and availability model are introduced in Section IV. The numerical illustration of
the method using these examples is also carried out in Section IV. Finally Section V summarizes
the paper.
II. Related Work
In this section we discuss some of the previous work done in computing the uncertainty in the
output metrics of analytic stochastic models, due to the uncertainty in the input parameters.
Most analytic methods for parametric epistemic uncertainty propagation perform manipulations
on the closed-form expression of model output metrics and derive either the variance of the output
metric or an exact or approximate con�dence interval.
Under assumption of independent failure of components and a binomial distribution (epistemic)
of probability of failures of individual components, Madansky et. al. [24] compute the exact con�-
dence interval for reliability of series and parallel systems. Exact con�dence interval for reliability
of a series only system has also been calculated by Sarkar [34] and Lieberman et. al. [21], when
the time to failure of each component follows an exponential distribution (aleatory). Both of these
methods use chi-square statistic to compute the con�dence interval of the overall system reliability.
Approximate con�dence interval for reliability of series and parallel systems have been computed
by Easterling et. al. [7] by assuming the overall reliability itself to follow a binomial distribution.
It matches the variance of the reliability computed by the Maximum Likelihood Estimation (MLE)
method to that computed under the binomial assumption and then uses the incomplete beta func-
tion to estimate the con�dence interval for the binomially distributed reliability. A large-sample
approximation to the con�dence interval for the reliability of a series-only system has been pro-
vided by Mawaziny and Buehler [26], which assumes the times to failure of the components to be
exponentially distributed (aleatory).
These analytic methods can compute con�dence interval of system reliability for simple combina-
torial models only (series-only, parallel-only or simple series-parallel models) and for a very limited
range of aleatory or epistemic distributions. Since they perform manipulations on the output of
the model, they require the output to be a closed-form expression of the model input parameters.
4
Clearly, this becomes intractable for larger number of parameters or more complex combinatorial
models.
For more complex combinatorial models, Coit et. al. [6] provide a way of estimating the variance
of system reliability using linearity property of expectation and linear transformations of variance.
This method takes the variance of reliability of individual components, as its input, to estimate
variance of overall system reliability. However, it is not applicable to state space models. Another
variant of this method, which can also be applied to state space models, �rst gets a Taylor series
expansion of the model output and then makes use of the properties of variance and expectation to
compute the variance of overall reliability. This method was also discussed in [43].
Parametric uncertainty analysis has also been used as one of the ways to quantify the uncertainty
in model output metrics due to the epistemic uncertainty in model input parameters, for a wide
range of dependability and performance measures [3, 29, 35, 43].
Some other methods of propagation of epistemic uncertainty (although discussed in the con-
text of deterministic models and not stochastic models), like response surface methodology (RSM),
Fourier amplitude sensitivity test (FAST) and fast probability integration (FPI) have been reviewed
in [14]. The FPI and FAST methods are computationally complex (more so with larger number of
input parameters) and require the model output to be closed-form expressions of the inputs. The
RSM method requires knowledge of experimental design to select the inputs. The uncertainty analy-
sis in RSM method is not carried out directly on the model but on a response surface approximation
to the model, which can be di�cult to construct with high �delity.
Singpurwalla [37] considers the input parameters of the analytic models as unknowable and the
availability or reliability computed using parameter values obtained from measurements or Bayesian
priors, to be conditional upon the values of the parameters. Singpurwalla coins the term survivability
to denote the unconditional value of reliability after the uncertainty in parameter values are taken
into account. This unconditioning can be done by means of a multi-dimensional integration. While
several analytical and numerical methods can be applied to carry out this integration, the problem
becomes intractable for large models and/or a large number of model parameters. An alternate
method of computing such integrals is the Monte Carlo method.
5
Monte Carlo sampling has been used by Haverkort et. al. [13] to propagate the uncertainty
through the model and compute the quantiles of the distribution of model output (along with
mean and variance). Their method assumes the epistemic distributions of parameters to be either
uniform or loguniform, depending on whether the approximate range of values of the parameter
or the approximate order of magnitude of the values of parameter, are known. In our method, we
do not simply assume the epistemic distributions of input parameters but use the real epistemic
distribution of input parameters and derive them using the aleatory distribution type and con�dence
interval of input parameters. In addition, we also employ Latin Hypercube Sampling (LHS) which
has been shown to be a more robust procedure as compared to random sampling procedure used in
[13].
The method discussed in this paper does not require the output of the model to be a closed-form
expression of input parameters and can be applied to a wide range of model types (combinatorial,
state-space and hierarchical models). Our method is easier to use for more complex models as it
does not have to manipulate model output. It is a sampling based non-obtrusive method which acts
as a wrapper to already existing analytic models and their solution methods. The details of this
method and its illustration with the help of uncertainty analysis of the reliability of REE system
are discussed in the rest of paper.
III. Overview of Uncertainty Propagation Method
In this section, we discuss a method for propagating the epistemic uncertainty in the model
input parameters through a stochastic analytic model, to compute the uncertainty in the model
output metrics. Due to the epistemic uncertainties, the model input parameters can be considered
a random vector. Therefore, the model output metric can be considered a random variable that
is a function of these input random variables. If random variables {Θi, i = 1, 2, . . . , k} be the set
of k input parameters, the overall reliability R(t), at time t, can be viewed as a random variable
(function) g of the k input parameters as R(t) = g(Θ1,Θ2, . . . ,Θk). Due to the uncertainty as-
sociated with the model parameters, computing the reliability at speci�c parameter values can be
seen as computing the conditional reliability R(t|Θ1 = θ1,Θ2 = θ2, . . . ,Θk = θk) (denoted by R(t|.)
6
in Equation 1). This can be unconditioned to compute the distribution of reliability via the joint
density fΘ1,Θ2,...,Θk(θ1, θ2, . . . , θk) of the input parameters (denoted by f(.) in Equation 1):
FR(t)(p) =
∫. . .
∫I(R(t|.) ≤ p)f(.)dθ1 . . . dθk (1)
where I(Event) is the indicator variable of the event Event. The unconditional expected reliability
at time t can be computed as shown in Equation 2:
E[R(t)] =
∫. . .
∫R(t|.)f(.)dθ1 . . . θk (2)
Similarly, the second moment of reliability, E[R(t)2] can be computed. With the second moment
and the expected value, the variance of reliability at time t, V ar[R(t)], can then be computed using
the relation V ar[X] = E[X2]− (E[X])2, where X is a random variable.
The task of numerically evaluating these integrals quickly becomes intractable for complex
expressions for system reliability or for larger number of model input parameters. Apart from the
computational problem, the joint epistemic density of all the model parameters, also needs to be
speci�ed. To simplify the evaluation of these integrals, we assume the epistemic random variables to
be independent and hence the joint probability density functions can be factored into the product
of marginals. Then we use a Monte Carlo approach to compute this integral.
Computing R(t|θ1, θ2, . . . , θk) at samples drawn independently from the probability distribution
of each of the input parameters (marginal distributions), FΘi(θi) ( where, i = 1, , k), would yield a
sample of values for the model output metric. The sample of values of reliability thus obtained, can
be analyzed to obtain its distribution and hence the con�dence interval. The steps in this uncertainty
propagation method are summarized in the �owchart shown in Figure 1. We �rst determine the
epistemic distribution and its parameters for each of the model input parameters. The number of
samples to be drawn from the epistemic distributions is then computed and samples are drawn from
the epistemic distribution of each of the model parameters using the Latin Hypercube Sampling
(LHS) procedure. The set of output values obtained by solving the model at each set of sampled
parameter values is �nally analyzed to obtain the con�dence interval of the model output metric
(unconditional reliability). The remainder of this section discusses each of these steps in details.
7
Start
i = 1
Determine the epistemic distribution of input parameter Θi
i++
i < k
Determine the sample size n for the model output
j = 1
Generate random deviate from each epistemic distribution.Generate input parameter vector vj = [θ1j , θ2j , . . . θkj ]
Solve stochastic analytic model to get R(t)j = g(vj)
j ++
j < n
Analyze model outputConstruct empirical CDF of R(t) from values {R(t)1, R(t)2, . . . , R(t)n}
Infer the confidence interval of Reliability from the CDF
End
Yes
No
Yes
No
Fig. 1 Flow Chart of Uncertainty Propagation Method
A. Determination of epistemic distributions
Since our method is based on sampling from the epistemic distributions of the model parameters,
we �rst need to determine the epistemic distribution of the model input parameters, from the
uncertainty in the input parameter value and the aleatory distribution type. In this paper we
assume the input parameter uncertainty to be presented to us in the form of con�dence intervals. It
has been shown that the posterior distribution obtained by applying Bayes' theorem to the likelihood
of input parameters (based on the aleatory distributions) and an appropriate non-informative prior
[9], provides the epistemic distribution of parameters of the aleatory distribution [38]. In our case
(assuming input uncertainty to be provided as con�dence interval), this method of determining the
epistemic distribution requires the knowledge of the aleatory distribution type and the number of
observations that would have been used to infer the con�dence interval of the parameter.
8
In this subsection, we �rst discuss computing the number of observations, ri, that would have
been used to compute the point estimate and con�dence interval of ith model parameter, (i =
1, 2, , k). Aleatory distributions for the times to failure (exponential) or to successfully detect or
recover from failures (Bernoulli) have been considered. Then we discuss determining the epistemic
posterior distribution, making use of the aleatory distribution and the number of observations
computed.
In the rest of the paper we will consider ri, (i = 1, 2, , k), to be the number of observations that
would have been used to compute the point estimate and con�dence interval of the model input
parameter θi (parameter of the aleatory distribution). However, in this subsection. when determin-
ing the number of observations for di�erent aleatory distribution types, we elide the subscript and
denote it as simply r.
1. Computation of the number of observations
We compute the number of observations (e.g., the number of failures observed or number of
failures recovered from) that would have been used to compute the con�dence interval of the input
parameter by inverting the relation between the width of the con�dence interval, the number of
observations and the point estimate of the input parameter. We consider determining the number
of observations when the aleatory distribution for time to failure follows an exponential distribution
and the aleatory distribution of successfully detecting or recovering from failures, follow a Bernoulli
distribution.
Exponential distribution : Number of observations from two-sided con�dence interval
When the time to failure of a component is exponentially distributed, the point estimate of the
rate parameter λ of the exponential distribution is given by λ = r/sr, where, r is the number of
observed failures during the period of observation and sr is the value of accumulated life on test
random variable Sr. Making use of the upper and lower limits of 100(1−α)% two sided con�dence
interval of λ [30, 41], the half width of the con�dence interval of λ, given by d, can be determined
as in Equation 3.
d =1
4sr
{χ22r,α/2 − χ2
2r,1−α/2
}(3)
9
where, χ22r,1−α/2 is the critical value of chi-square distribution with 2r degrees of freedom. From
Equation 3, the number of observations (number of observed failures, repairs etc.) that would have
resulted in a given half width d, at a con�dence coe�cient (1 − α) can be computed as shown in
Equation 4.
r =
⌈λ
4d
{χ22r,α/2 − χ2
2r,1−α/2
}⌉(4)
Starting with an initial assumption (using normal approximation to the chi-square distribution) for
the value of r, the �xed-point Equation 4 is solved iteratively to converge at a value of r. Similar
reasoning can be used to to compute r when upper or lower one-sided con�dence interval is provided.
Bernoulli distribution : Number of observations from one-sided con�dence interval
The point estimate of the coverage probability is given by c = sr/r, where sr is the value of the
random variable Sr, denoting the number of faults/errors detected and recovered and r is total
number of faults/errors injected. The lower limit of the upper one sided 100(1 − α)% con�dence
interval for the coverage probability is given by Equation 5 [41]:
cL = 1−χ22(r−sr+1),α
2r(5)
Inverting Equation 5, the number of injections that would have resulted in this lower limit of the
con�dence interval, cL, at a con�dence coe�cient (1− α) can be obtained as in Equation 6:
r =
⌈χ22(r(1−c)+1),α
2(1− cL)
⌉(6)
Equation 6 is iteratively solved to obtain the value of r, using an initial approximation for r. The
initial value is calculated using the normal approximation of the chi-square distribution.
2. Determining the epistemic distributions
As discussed earlier, the posterior distribution obtained by applying Bayes' theorem to the
likelihood of input parameters (based on the aleatory distributions) and a non-informative prior [9],
provides the epistemic distribution of parameters of the aleatory model [38]. The likelihood function
of the model parameters need the number of observations that would have been used to compute
the given con�dence interval for the model parameter, as computed earlier in Section IIIA 1.
10
In the example in this paper we have modeled the aleatory distributions of times to failure
by the exponential distribution. Choosing a non-informative prior, the epistemic distribution of
the parameter of exponential distribution results in a gamma distribution with a positive integer
parameter [9, 38] and hence an Erlang distribution.
If the number of observations that would have been used to compute the ith parameter of the
availability model (a rate parameter) θi, be ri and sri would have been the value of accumulated
time on test for the corresponding component, then choosing a non-informative prior, the posterior
density for the rate parameter Θi (hence the epistemic density), is the gamma density [9]:
fΘi|Sri(θi|sri) = gamma(ri + 1, sri) (7)
where sri = ri/θi
The aleatory uncertainty in various coverage probabilities have been modeled using Bernoulli
distribution. If the number of injections that would have been used to compute the value of jth model
parameter (coverage probability), θj be rj , then, choosing a non-informative prior, the posterior
density function (epistemic density function) is known to be the beta density function [9]:
f(θj |yj) = beta(yj + 1, rj − yj + 1) (8)
where, yj is the number of successfully handled faults out of a total of rj injected faults.
B. Determination of the number of samples from epistemic distributions
Once the epistemic distributions of the model input parameters are determined, we draw sam-
ples from each of the epistemic distribution (since we assume the parameters to be independent
random variables). However, we need to determine the number of samples to be drawn from these
distributions to obtain the minimum sample size required for the con�dence interval of the output
measure, at a given con�dence level . We consider the number of observations ri that would have
been used to compute the point estimate and con�dence interval of each model input parameter
Θi (i = 1, 2, . . . , k) as well as the sample size, mo (o = 1, 2, . . . , l), of the output measures, ∆o, in
determining this sample size. The total number of samples to be drawn from the epistemic dis-
tribution of each model parameter is computed as n = max{r1, r2, . . . , rk,m1,m2, . . . ,ml}. The
11
sample size based on the output measure(s) is considered only if a desired con�dence interval of any
of the output measures is provided. If the desired width (or half width) of the con�dence interval
of any of the output measures of the model is provided, we assume the output measure to follow a
normal distribution, to compute the number of samples mo of the output measure ∆o, and invert
the relation between the half width of con�dence interval and the number of samples to obtain mo.
C. Sampling Procedure
Subsequent to the determination of epistemic distribution of each input parameter, samples or
random deviates need to be drawn from these distributions. We employ Latin Hypercube Sampling
(LHS) procedure in our uncertainty propagation method. Latin Hypercube Sampling method (LHS)
divides the entire probability space into equal intervals (the number of intervals is equal to the
number of samples needed) and draws a random sample from each of the intervals. Thus it is almost
guaranteed to address the entire probability space evenly and easily reach low probability - high
impact areas of the epistemic distribution of input parameters. Once the samples are generated,
random pairings without replacement is carried out between samples of di�erent parameters, to
ensure randomness in sequence of the samples. It has been shown that for a given sample size, a
statistic of model output obtained by LHS sampling procedure, will have a variance less than or
equal to that obtained by random sampling procedure. In other words, at a particular sample size,
LHS sampling yields a more e�cient estimator for a statistic of model output than that by random
sampling, providing a more robust sampling procedure [14, 23, 27]. The actual samples of random
numbers or the random deviates can be generated by any of the methods like inverse transform
method, rejection sampling, Box-Mueller transform or Johnson's translation [19]. A sample from
the distribution of each of the k model input parameters will result in a vector of k parameter
values. We use the LHS sampling procedure to draw a total number, n, of samples (sample size as
determined in Section III B) from each distribution. Therefore, n such vectors (v1 through vn) will
be generated as shown in Equation 9.
12
v1
v2
...
vn
=
θ11, θ21, θk1
θ12, θ22, θk2
...
θ1n, θ2n, θkn
(9)
In the examples in the paper, for the rate parameters in the models, we sample from the gamma
distribution. For the coverage factors in the models, we sample from the beta distribution, as
determined in Section IIIA 2.
D. Solving the Analytic Model
The reliability model is solved at each of the n sets of sampled input parameter values obtained
from the epistemic distributions to obtain a set of values for the model output, as shown in Equation
10. The model may be solved by using software packages like SHARPE [42] or if closed-form solutions
exist, they may be solved programatically by simple user programs. This method is independent of
the model solution method and any pre-existing model solution technique can be applied. It also
does not require the model output to be a closed-form expression of model input parameters as it
does not need to manipulate the model output(s).
{R(t)1, R(t)2, . . . , R(t)n} = {g(v1), g(v2), . . . , g(vn)} (10)
E. Statistical Analysis of the Output
Once the set of model output values {R(t)1, R(t)2, , R(t)n} is obtained, to quantify the uncer-
tainty in the output due to the epistemic uncertainties in model input parameters, we compute
the con�dence interval of the output metric(s), at a desired con�dence level. A non-parametric
method is used to calculate the con�dence interval of the output measure, as it would obviate the
need to make any assumptions about the distribution of the model output or to �t a distribution
to the model output values. An empirical Cumulative Distribution Function (CDF) [41] of the set
of output values {R(t)1, R(t)2, , R(t)n}, is constructed . The values of the output measure at ap-
propriate percentile points, from the CDF, provide the con�dence interval at the desired con�dence
13
level. In our examples, we use the values of the model output corresponding to 2.5th and 97.5th
percentile, as the limits of the 95% two sided con�dence interval. Similarly the value corresponding
to 5th percentile is chosen as the lower limit of the 95% upper one sided con�dence interval and the
value corresponding to the 95thpercentile is chosen as the upper limit of the 95% lower one sided
con�dence interval.
IV. Illustrative Example
We apply the uncertainty propagation method discussed so far, to the reliability analysis of the
NASA Remote Exploration and Experimentation (REE) system [2, 20]. The NASA Jet Propulsion
Laboratory REE Project was a large multi-year technology demonstration project to develop a
low-power, scalable, fault-tolerant, high-performance computing platform for use in space and to
demonstrate that signi�cant on-board computing enables a new class of scienti�c missions. A REE
testbed was developed to test, re�ne, and validate scalability of architectures and system approaches
to achieve the dependability goals.
The REE system was expected to provide continuous operation through graceful degradation
despite experiencing transient or permanent failures. It was expected to experience a small number
of transient component failures per day, induced by radiations. The permanent component failures
were expected to be experienced a small number of times over several years. The system provided
fault detection and recovery mechanisms so that applications could operate in the presence of faults.
Availability and reliability of the REE system was analyzed in [5], with the help of several models,
to assess its fault-tolerance features. These models were speci�ed and solved using the SHARPE
[42] software package.
The REE system is a collection of processing elements (referred to as nodes) connected together
by a Myrinet [1]. Each node is a commercially available computer running commercial o�-the-shelf
(COTS) operating system based on UNIX. The architecture of REE system is shown in Figure 2.
Redundancy in hardware, software, time and data are used to achieve fault tolerance. Fault detection
and recovery are provided by Software implemented fault-tolerance (SIFT) and system executive
(SE) [8, 31�33]. SE is responsible for local error detection and failure recovery while the system
14
…
I/O Servers I/O ServersSCP
…
REE SystemSystem Executive (SE)Application Middleware OS Application Middleware OS Application Middleware OSProcessors Memory Controller Myrinet I/Fs Processors Memory Controller Myrinet I/Fs Processors Memory Controller Myrinet I/FsSoftwareHardware
MyrinetFig. 2 REE System Architecture
control processor (SCP) provides fault detection and recovery at the system level. The SCP relies
on periodic node health status messages or heartbeats from SE on each node to determine their
status. Nodes can be individually reset/restarted by either the SE or the SCP.
A service request to the system is assigned to multiple nodes to be processed in parallel, by a
fault-tolerant scheduler. Same software components have di�erent implementations across di�erent
nodes in the system, to provide redundancy and design diversity. Results from these nodes are then
compared by SE to detect and tolerate faults. The intermittent and transient faults [17] can be
detected and recovered from, by SIFT, SE or SCP. However, as the REE system operates without
human intervention, components cannot be replaced after permanent failures. Hence the permanent
failures require the system to be recon�gured to a degraded mode of operation and a�ect the long
term behavior of the system. Figure 3 captures the fault detection and recovery scheme employed
in REE.
The hardware and software component failures can be transient or permanent in nature. Ma-
jority of the transient faults are masked due to the use of error correcting codes and redundant
hardware. Unmasked transient failures may freeze the processors making the system unresponsive
15
Collect Information from SEPeriodic InspectionNode RejuvenationWhole REE SystemSCP
Inside NodeMiddleware (SIFT)Fault ? NoYes
Successful ? YesNoNode Fail; Report to SEComponent Recovery (Retry, Rollback)Working; Checkpointing; Heartbeat to SE; Error Detection(Acceptance Test)
Inside ClusterSEWorking and Error Detection(Heartbeat from Nodes, Acceptance Test, Other Fault Detection Mechanism)2 or 3 SE Working ? NoYesVoteFault ?No YesNode Recovery(Restart, Reboot)Successful ? NoYes Fault Isolation (Power off the Faulty Node)Spare Node ?No YesReport to SCPFailover (Transfer Task)
Fig. 3 Flowcharts
or generate erroneous outputs. Heartbeat messages are relied upon, by the SE to detect unrespon-
sive or frozen nodes. Erroneous outputs are detected with the help of acceptance tests by SIFT.
Upon detection of either an unresponsive node or erroneous output, the node is rebooted after
transferring its tasks to other spare nodes (the tasks are restarted from the last checkpoint). In
addition, SCP selectively reboots several nodes as a preventive action, according to their uptime
and current health status. This preventive action, which is called rejuvenation, helps in getting rid
of latent faults and e�ectively tolerate �soft� or aging related faults such as memory leaks [12].
Transient and intermittent faults are assumed to not a�ect reliability of the system as after
recovery actions, the failed nodes will eventually return to use. If there is a permanent fault
involved, the faulty node cannot be rebooted successfully. In this case, a noti�cation is sent to SE,
the faulty node will be isolated and the number of working nodes will decrease. Eventually, when
the number of working nodes is smaller than the minimum requirement (2 in the testbed), the whole
system fails.
We use the reliability model of the REE system as an example to illustrate the uncertainty
propagation method. Hence we will only consider the permanent faults in the system.
16
A. Reliability Model
A two-stage hierarchical model [42] is used to capture the failure behavior of the system and
its components in [5]. Here we modify the model to make it a three-stage hierarchical model to
capture the failure behavior of the software component in greater details. At the lowest level, a
Markov model captures the failure behavior of the software subsystem. In the middle level, the
failure behavior of individual components is captured using reliability block diagrams (RBD). The
reliability computed from the Markov model, at the lowest level, provides one of the inputs to
the RBD. The top level model is a fault tree that takes the reliability of each of the individual
component (obtained by solving the lower level model), as an input and takes into account the
interactions between the individual components to provide the overall system reliability.
The REE architecture uses redundancy at several levels. Each node uses redundancy in the form
of spare processor chips and spare Myrinet interfaces to increase its reliability. The node subsystem
is con�gured as a k-out-of-n system of such nodes (with k = 2 in the prototype implementation).
The I/O subsystem also has redundancy in its con�guration and is implemented as a parallel system
with several redundant nodes. All the nodes in node system as well as the I/O system are assumed to
exhibit the same stochastic behavior regarding faults. The faults are considered to be mutually in-
dependent. Since REE system operates without human intervention, no replacement of components
after a permanent failure has been assumed. It is these permanent failures of components that are
considered in the reliability model. Most software failures in the nodes are expected to be recovered
from by retry, process restart or node reboot. The Markov model shown in �gure 4 captures these
escalated levels of recovery actions. The software component is considered to have failed only if the
software failures cannot be recovered from, even after node reboot. Software failures that cannot be
recovered from even after retry, restart or reboot can be identi�ed with Bohrbugs while those that
can be recovered from by use of these actions, can be identi�ed with Mandelbugs [11, 12].
The RBD model for a node is shown in Figure 5. Figure 6 shows the higher level fault tree
to compute the reliability of the entire system. In Figures 5 and 6, the failure rates of various
components are as follows: Myrinet (λnet), Processor (λpro), Memory ( λmem), Memory Controller
(λbmc), Node Controller (λndc), Non-Volatile Memory ( λnvm), PCI Bus (λbus), Myrinet Network
17
ρ∗s
UP UA UR UB FAILswλ
α∗c
α∗− )1( c
β*b
ρ*)1( s− β*)1( b−
Fig. 4 Markov model for Reliability of Software
SoftwareProcessorProcessor Memory MemoryController NodeController Non-VolatileMemory PCIBus Myrinet I/FMyrinet I/FFig. 5 RBD Model for Node Level Reliability
I/F (λnif ) and Software (λsw). Rates of failure, retry, restart and reboot as well as the coverage
REE System Failure
(n-1) of n
IOS IOS IOS IOS
Myrinet
Node System
Fig. 6 Fault Tree Model for REE System Reliability
probabilities in the Markov model in �gure 4 are as follows: Software failure (λsw), Retry rate (α),
Restart rate (ρ), Reboot rate (β) and the coverage probabilities for retry, restart and node reboot
- (c, s and b, respectively).
18
B. Uncertainty Propagation through the reliability model
The time to failure of individual components in the RBD model in �gure 5 are assumed to
follow an exponential distribution. Table 1 summarizes the point estimate of the parameters of the
time to failure distribution of various components in the REE system. These values were obtained
from [5]. To represent the uncertainty in the model input parameters, we choose the half-width of
95% two-sided con�dence interval of di�erent parameters to be 15% to 33% of the point estimate.
Parameter Value
λpro 0.01 per year
λmem 0.03 per year
λbmc 0.01 per year
λndc 0.01 per year
λnvm 0.01 per year
λbus 0.02 per year
λnif 0.03 per year
λnet 0.0006 per year
λsw 8.76 per year
α 31536000 per year
ρ 3153600 per year
β 52560 per year
c 0.9
r 0.9
b 0.9
Table 1 Point estimates of reliability model parameters
The steps in uncertainty propagation through the reliability model of the REE system are as
follows :
(a) Determining the epistemic distribution of model parameters : We �rst determine the
number of observations that would have been used to compute the point estimate and 95% con�dence
interval, for each model parameter, as explained in Section IIIA 1. The number of observations, thus
19
calculated for each model input parameter, is summarized in Table 2. Since the times to failures of
the components in the system have been assumed to follow exponential distribution, as explained
in section IIIA 2, the epistemic distributions of the parameter values are calculated to be gamma
distribution with parameters as summarized in Table 2.
Parameter Num. Observations Epistemic Distributions
λpro 117 gamma(118,8.57e-5)
λmem 167 gamma(168,1.79e-4)
λbmc 92 gamma(93,1.08e-4)
λndc 117 gamma(118,8.54e-5)
λnvm 103 gamma(104,9.71e-5)
λbus 146 gamma(147,1.37e-4)
λnif 208 gamma(209,1.44e-4)
λnet 19 gamma(20,4.21e-5)
λsw 65 gamma(66,7.42)
β 81 gamma(82,1.54e-3)
b faults injected : 57, faults detected 51 beta(52,5)
Table 2 Epistemic distributions of parameters
(b) Determination of number of samples from epistemic distributions : The number
of samples needed from the distribution of each input parameter is determined to be n = 209, largest
of all the number of observations of the input parameters.
(c) Sampling Procedure: To implement the LHS procedure, we use Mathematica's built-in
function RandomReal to generate random deviates from Uniform distribution in the intervals the
probability space has been divided into, for LHS sampling. Using those random deviates, we obtain
quantiles from gamma or beta distribution using Mathematica's function Quantile. We also use
the function RandomSample to obtain random pairing without replacement, between samples from
di�erent epistemic distributions.
(d) Solving the Analytic Model : At each of the n = 209 set of values of the parameters,
the hierarchical model is solved using SHARPE [16].
(e) Summarizing the Output Values : The empirical Cumulative Distribution Function
20
(CDF) of the reliability of the system output by both methods is constructed from the n = 209
values of output, obtained by solving the system availability model using SHARPE. Figure 7 shows
the empirical CDF of reliability of the REE system at a time t = 5 years.
Fig. 7 CDF of Reliability of REE System at t = 5 years
The 5th percentile from the CDF provides the lower limit for the 95% upper one-sided con�dence
interval of reliability, which is computed to be (0.956339, 1). The upper and lower limits of the
two-sided 95% con�dence interval of reliability are provided by the 97.5th percentile and 2.5th
percentile, respectively. The two-sided 95% con�dence interval of reliability is computed to be
(0.951727, 0.981611).
To evaluate the robustness of LHS procedure as compared to the random sampling procedure, 20
runs of the entire uncertainty propagation method, each run with 209 samples from each epistemic
distribution were performed, by both the LHS sampling procedure as well as the random sampling
procedure. The variance of sample mean of the reliability at time t = 5 years, computed in each of
the 20 iterations, was computed as an indicator of the robustness of the sampling procedure. This
variance obtained by using LHS sampling procedure in the uncertainty propagation method was
4.17e−8, which was less than that computed by using the random sampling procedure as 1.56e−7.
The empirical CDF of means of reliability obtained in the 20 executions by both the sampling
procedures are summarized in Figure 8. It can be shown that the variance of other statistics and
quantiles of the distribution of reliability will also have lower variance with LHS procedure. Thus
21
in this example, LHS provides a more robust sampling procedure.
Fig. 8 CDF of Mean of Reliability, computed by LHS and Random Sampling Procedures
V. Summary
In this paper, a method for computing the uncertainty in model output metrics, due to the
epistemic uncertainties in the model input parameters, has been presented. This method employs
Monte Carlo sampling to compute the con�dence interval and distribution of the model output
metric. It is a non-obtrusive method and acts as a wrapper to already existing analytic stochastic
models and their solution techniques. It works for a wide range of model types and can easily be
applied to complex models as it does not require the model output to be a closed form expression of
the input parameters. While we apply this method to compute the con�dence interval of reliability
of the REE system as an illustration, it can be directly applied to compute the con�dence interval
of other dependability, performance and performability measures, computed by solving stochastic
analytic models. Although this method entails Monte Carlo sampling of the input parameter space,
there is no simulation carried out. For each vector of sampled model parameter values, numerical
solution of the analytic model is carried out. We also demonstrate the greater robustness of LHS
sampling procedure over the random sampling procedure, in our example. This method does not
assume any distribution for the model output or �ts any distribution to it. It uses appropriate
percentiles from the empirical CDF of the model output to compute its con�dence interval. Meth-
22
ods of obtaining the distribution of the parameters of the model (epistemic uncertainty) from the
distribution of observed system events (aleatory uncertainty) have also been discussed.
Acknowledgement
The authors would like to thank Dr. S. Dharmaraja and Ms. Vandana Khaitan of Indian
Institute of Technology, Delhi, for valuable discussions and help with the models.
This paper is submitted as part of a special section on the New Millenium Program ST-8 project.
References
[1] Myrinet overview. In http://www.myri.com/myrinet/overview/.
[2] Ree project overview. In http://www-ree.jpl.nasa.gov/overview.html.
[3] J.T. Blake, A.L. Reibman, and K.S. Trivedi. Sensitivity analysis of reliability and performability
measures for multiprocessor systems. In 1988 ACM SIGMETRICS conf. on Measurement and modeling
of computer systems.
[4] G. Bolch, K.S. Trivedi, S. Greiner, and H. de Meer. Queueing Networks and Markov Chains: Modeling
and Performance Evaluation with Computer Science Applications. J. Wiley & Sons, 2006.
[5] D. Chen, S. Dharmaraja, D-Y. Chen, L. Li, K.S. Trivedi, R. R. Some, and A.P. Nikora. Reliability and
availability analysis for the jpl remore exploration and experimentation system. In In Proc. Int. Conf.
on Dependable Systems and Networks, pages 337�344, 2002.
[6] D. W. Coit. System reliability con�dence intervals for complex systems with estimated component
reliability. IEEE Trans. on Reliability, 46(4):487�493, Dec 1997.
[7] R. G. Easterling. Approximate con�dence limits system reliability. Journal of the American Statistical
Association, 67(337):220�222, Mar 1972.
[8] S. Garg, Y. Huang, C. Kintala, and K.S. Trivedi. Minimizing completion time of a program by check-
pointing and rejuvenation. In SIGMETRICS, 1996.
[9] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC,
second edition, 2003.
[10] R. German. Performance Analysis of Communication Systems: Modeling with Non-Markovian Stochas-
tic Petri Nets. Kluwer Academic Publishers, 2000.
[11] M. Grottke, A.P. Nikora, and K.S. Trivedi. An empirical investigation of fault types in space mission
system software. In 40th Annual IEEE/IFIP Intl. Conf. on Dependable Systems and Networks (DSN
23
2010), 2010.
[12] M. Grottke and K.S. Trivedi. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Computer,
40(2):107�109, 2007.
[13] B.R. Haverkort and A.M.H. Meeuwissen. Sensitivity & uncertainty analysis of markov-reward models.
IEEE Trans. on Reliabiility, 44(1):147�154, March 1995.
[14] J.C. Helton and F.J. Davis. Latin hypercube sampling and propagation of uncertainty in analysis of
complex systems. Reliability Engineering and System Safety, 81(1):23�69, 2003.
[15] J. Hilllston. A Compositional Approach to Performance Modelling. Cambridge University Press, 1996.
[16] C. Hirel, R.A. Sahner, X. Zang, and K.S. Trivedi. Reliability and performability modeling using sharpe
2000. In 11th Intl. Conf. on Computer Performance (TOOLS2000), 2000.
[17] O. Ibe, R. Howe, and K.S. Trivedi. Approximate availability analysis of vaxcluster systems. IEEE
Trans. Reliiability, 38:146�152, 1989.
[18] B.W. Johnson. Design and analysis of fault-tolerant digital systems. Addison-Wesley, 1989.
[19] M.E. Johnson. Multivariate Statistical Simulation. John Wiley & Sons, New York.
[20] J.H. Lala and J.T. Sims. A dependability architecture framework for remote exploration & experimen-
tation computers. In The 29th Int.Symp. Fault-tolerant Computing, 1999.
[21] G. J. Leiberman and Sheldon M. Ross. Con�dence intervals for independent exponential series systems.
Journal of the American Statistical Association, 66(336):837�840, Dec 1971.
[22] M.R. Lyu and V.B. Mendiratta. Software fault tolerance in a clustered architecture: techniques and
reliability modeling. In 1999 IEEE Aerospace Conference.
[23] Iain A Macdonald. Comparison of sampling techniques on the performance of monte-carlo based sen-
sitivity analysis. In 11th. Intl. IBPSA Conference, 2009.
[24] A. Madansky. Approximate con�dence limits for the reliability of series and parallel systems. Techno-
metrics, 7(4):495�503, Nov. 1965.
[25] M.A. Marsan, G. Balbo, G. Conte, S. Donatelli, and G. Franceschinis. Modeling with Generalized
Stochastic Petri Nets. J. Wiley & Sons, 1995.
[26] A.H. El Mawaziny and R.J. Buehler. Con�dence limits for the reliability of series systems. Journal of
the American Statistical Association, 62(320):1452�1459, Dec 1967.
[27] M.D. McKay, R.J. Beckman, and W.J. Conover. A comparison of three methods for selecting values of
input variables in the analysis of output from a computer code. Technometrics, 42(1):55�61, 2000.
[28] J. Muppala, R. Fricks, and K.S. Trivedi. Techniques for system dependability evaluation. In Compu-
tational Probability, W. Grassman (ed.), pages 445�480. Kluwer Academic Publishers, 2000.
24
[29] A.V. Ramesh and K.S. Trivedi. On the sensitivity of transient solutions of markov models. In 1993
ACM SIGMETRICS conf. on Measurement and modeling of computer systems.
[30] D. Rasch. Sample size determination for estimating the parameter of an exponential distribu-
tion. Akademie der Landwirtacheftswissenschaften der DDR Forschungszentrurn �ir Tierproduktion
Dummerstorf-Rostock, 19:521�528, 1977.
[31] D.A. Rennels, D.W. Caldwell, R. Hwang, and K. Mesarina. A fault-tolerant embedded microcontroller
testbed. In Paci�c Rim Int. Symp. Fault Tolerant Systems (PRFTS97), 1997.
[32] J.A. Rohr. Starex self-repair routines: software recovery in the jpl-star computer. In The 25th Int.
Symp. Fault-tolerant Computing, 1995.
[33] J.A. Rohr. Software-implemented fault tolerance for supercomputing in space. In The 28th Int. Symp.
Fault-Tolerant Computing, 1998.
[34] T. K. Sarkar. An exact lower con�dence bound for the reliability of a series system where each compo-
nent has an exponential time to failure. Technometrics, 13(3):535�546, Aug 1971.
[35] N. Sato and K.S. Trivedi. Stochastic modeling of composite web services for closed-form analysis of their
performance and reliability bottlenecks. In 5th. Intl. Conf. on Service Oriented Computing, ICSOC,
2007.
[36] D.P. Siewiorek and R. S. Swarz. Reliable Computer Systems: Design and Evaluation. AK Peters, Ltd,
1998.
[37] N. D. Singpurwalla. Reliability and Risk: A Bayesian Perspective. John Wiley & Sons, 1 edition, 2006.
[38] M. Stamatelatos, G. Apostolakis, H. Dezfuli, C. Everline, S. Guarro, P. Moeini, A Mosleh, T. Paulos,
and R. Youngblood. Probabilistic risk assessment procedures guide for nasa managers and practitioners.
http://www.hq.nasa.gov/o�ce/codeq/doctree/praguide.pdf, Ver. 1.1, 2002.
[39] L.A. Tomek and K.S. Trivedi. Fixed point iteration in availability modeling. In 5th Intl. GI/ITG/GMA
Conf. on Fault-Tolerant Computing Systems, pages 229�240, 1991.
[40] K. Trivedi, D. Wang, D.J. Hunt, A. Rindos, W.E. Smith, and B. Vashaw. Availability mpdeling of sip
protocol on ibm websphere. In Proc. of Paci�c Rim Dependability Conference, 2008.
[41] K.S. Trivedi. Probability and Statistics with Reliability, Queuing and Computer Science Applications.
J. Wiley & Sons, New York, 2001.
[42] K.S. Trivedi and R. Sahner. Sharpe at the age of twenty two. SIGMETRICS Performance Evauation
Review, 36(4):52�57, 2009.
[43] Liang Yin, M.A.J. Smith, and K.S. Trivedi. Uncertainty analysis in reliability modeling. Reliability
and Maintainability Symposium, pages 229�234, 2001.
25