Transparent reliability model for fault-tolerant safety systems

:-,:.~ ~;~

ELSEVIER P l l : S 0 9 5 1 - 8 3 2 0 ( 9 6 ) 0 0 1 1 3 - 5

Reliability Engineering and System St(fety 55 (1997) 25-38 © 1997 Elsevier Science Limited

Printed in Northern Ireland. All rights reserved 0951-8320/97/$17.00

Transparent reliability model for fault-tolerant safety systems

Lars Bodsberg & Per Hokstad The Foundation for Scientific and Industrial Research at the Norwegian Institute of Technology,

SINTEF Industrial Management, Safety and Reliability, N-7034 Trondheim, Norway

(Received 24 January 1996: accepted 4 September 1996)

A reliability model is presented which may serve as a tool for identification of cost-eflective configurations and operating philosophies of computer-based process safety systems. The main merit of the model is the explicit relationship in the mathematical formulas between failure cause and the means used to improve system reliability such as self-test, redundancy, preventive maintenance and corrective maintenance. A component failure taxonomy has been developed which allows the analyst to treat hardware failures, human failures, and software failures of automatic systems in an integrated manner. Further- more, the taxonomy distinguishes between failures due to excessive environmental stresses and failures initiated by humans during engineering and operation. Attention has been given to develop a transparent model which prowides predictions which are in good agreement with observed system performance, and which is applicable for non-experts in the field of reliability. © 1997 Elsevier Science Limited. All rights reserved.

1 INTRODUCTION

Proper analytical tools available during design of safety systems may ensure that a proper system configuration is installed from the very beginning; thereby reducing significantly the overall system cost.

According to Bodsberg, 1 reliability assessment models presented in the literature have mainly been developed as a means to gain confidence in system reliability rather than as a means to achieve reliability. Furthermore, most models, as well as most surveys of these models, are limited to either hardware reliability, software reliability or human reliability only and do not treat hardware, software and human reliability in an integrated manner. Thus, analyses results tend to reflect theoretical reliability rather than field reliability.

The paper presents a transparent reliability model for identification of cost-effective configurations and operating philosophies of safety systems (system optimization) in early design. The safety system is evaluated both with respect to trip failures (immediately revealed) and failures on demand (dormant failures). The model which is termed VULCAN (VULnerability CAlculatioN), is primarily a tool for non-experts in reliability. Thus, the model should contribute to enhanced use of reliability analysis in the

25

engineering disciplines, and to bridging the gap between reliability theory and application. The main application of the model is in the early system design stage, where the general system architecture (e.g., dual loo2, triple redundant, etc.) is identified. Once the architecture is chosen, a more detailed (computer) model could be used to evaluate the reliability of specific alternative design configurations (e.g., tools developed by EPRI2).

The model accounts for all main failure causes as demonstrated in field experience. Attention has been given in the development of the VULCAN model to quantify the effect of the primary means for achieving reliability in a computer-based process safety system, which are: 1) Fault prevention, how to prevent faults during engineering and construction (e.g., by systema- tic design principles, strict construction rules and extensive component screening), 2) Fault removal, how to reduce the presence of faults during the operational phase of a system's life cycle (e.g., preventive maintenance (PM) or corrective maintenance (CM)), and 3) Fault tolerance, how to provide a service complying with the specification in spite of faults (e.g., self-test or redundancy).

The development of the VULCAN model has been based on direct observation of field performance of process safety systems. This paper does not include a

26 L. Bodsberg, P. Hoks tad

formal validation of the VULCAN model. By nature, safety systems are very reliable and a long observation period, or data from a lot of installations, is required to validate the model. Field data has, however, been collected according to the VULCAN failure taxonomy to show that the overall concept of the model is plausible. More attention to the VULCAN failure taxonomy during future data recording may give the field data required for model validation.

In the following, VULCAN model nomenclature, abbreviations and notation are summarized. (Model assumptions and limitations are summarized in the Appendix). Next, model output, input and formulae are described. It should be noted that the reliability of a safety system is quantified in two steps in the VULCAN model. First, the reliability of each subsystem is predicted by aggregating the reliability of its components. Next, the reliability of the system is predicted by aggregating the reliability of each subsystem. In the last section, main features of the VULCAN model are summarized. Some practical experience on application of this method on industrial problems are not included as they have been published earlier. 3 A more thorough discussion on test scheme optimization has also previously been published. 4

2 MODEL DEFINITIONS

2.1 Nomenclature

System fault: Failure of a system component. The component failure is liable to lead to a system failure unless this is prevented by the use of fault-removal or fault-tolerant techniques. System error: Simultaneous common mode failures of redundant system components due to a common cause. (For example, if both components in a redundant subsystem fail simultaneously due to a common cause, there are two system faults, but only one system error.) Error multiplicity: The number of redundant components which fails simultaneously.

2.2 Abbreviations

RBD: Reliability block diagram VBD: VULCAN block diagram FRO: Fail to operate (on demand) SO: Spurious operation NC: Non-critical LCC: Life cycle cost CSU: Critical safety unavailability STR: Spurious trip rate.

2.3 Notation

The term failure refers to failure of a componen t whereas the term error refers to error of a (sub)sys tem.

AF I S i N N,"N,"N rate of natural aging FRO-, SO-, NC- failures rate of stress FRO-, SO-, NC-failures rate of intervention FRO-, SO-failures rate of input SO-failures total rate of natural aging failures

total rate of stress failures ( = A F + A s +

total rate of physical failures ( = A N + A s )

total rate of FRO-failures ( = A r + As F +

A,5) total rate of SO-failures ( = A s + A s +

+ A3 total rate of undetected FRO-failures ( = AF(1 -- c r ) + A F ( 1 - - c F) + A F) total rate of undetected SO-failures ( = As(1 - c s) + As(1 - c s) + A s + As,)

rate of single (k = 1), double (k = 2), triple (k = 3), complete (k > 3) FRO- errors (rate of FRO-failures involving exactly one component (k = 1), or rate of simultaneous FRO-failures involving exactly k components due to a common cause (k = 2,3,>3)), k -< n total rate of FRO-errors ( F = Ao/.) + A~ln ) + F A(s/.) + A~>3/.)) test-independent failure probability (of an input FRO-failure) test-independent error probability of single (k -- 1), double (k = 2), triple (k = 3), complete (k > 3) errors (test- independent failure probability of input FRO-failures involving exactly one component (k = 1), or test-independent failure probability of simultaneous input VrO-failures involving exactly k components due to a common cause (k = 2,3,>3)), k - n error multiplicity probability of FRO- failures (fraction of FRO-errors involving FRO-failures of exactly one component (k = 1), or fraction of FRO-errors involving simultaneous FRO-failures of exactly k components due to a common cause (k = 2,3,>3), k -< n error multiplicity probability of SO- failures (fraction of SO-errors involving SO-failures of exactly one component (k = 1), or fraction of SO-errors involving simultaneous, SO-failures of exactly k

A F i S i N S, ItS, ~tS

A~ASH

AN

As

A p

A F

A s

F A( kin )

T IF

TIE(k/.)

s P k/n

Model for fault-tolerant safety systems 27

fk/n

C F ' C S

R

T

MTTR CSUi,kln

CSUT

S T R i , k / n

STRT

RR

components due to a common cause (k = 2,3,>3), k <- n redundancy function (effect of redundancy in a k-out-of-n RBD structure) coverage of self-test w.r.t. FRO-, SO- failures recovery capability of self-test w.r.t. SO-failures time between manual tests (manual test interval) mean time to repair critical safety unavailability of a subsystem, I, with a k-out-of-n RBD structure, k -< n critical safety unavailability of a system (considering al[l subsystems) spurious trip rate of a subsystem, I, with a k-out-of-n RBD structure, k <-n spurious trip rate of a system (considering all subsystems) repair rate of ,physical failures.

3 VULCAN M O D E L OUTPUT

The most important service of a safety system is: 1) to respond successfully to an abnormal operating condition (safety performance). This is the reason for installing the system in the first place. Two other services related to life cycle cost (LCC performance) are: 2) avoidance of automatic safety actions without the presence of an abnormal operating condition and 3) avoidance of the need. for fault removal actions. Corresponding performance measures used in the VULCAN model are shown in Table 1.

The LCC is the total cost for the user to purchase, install, use, and maintain the system. The three main cost elements are: 1) primary investments, 2) cost for operation and maintenance which is a function of RR, and 3) unavailability cost 'which is a function of STR.

The possibility of manual activation of the safety system is not included in the analyses. Thus, CSU does not reflect the probability that operators activate the safety system manually. Furthermore, only unrevealed system faults are considered critical from a safety point of view. Thus, unavailability of the safety system during restoration and modification of faults (system in by-pass mode) does not contribute to the CSU figure. Nor does the duration of the manual test. During preventive or corrective maintenance, it will generally be known that the safety system is unavailable. This situation will not necessarily represent a safety hazard, but rather requires some additional precautions.

4 VULCAN COMPONENT F A I L U R E TAXONOMY

4.1 Component failure mode

Based on the three main services of the safety system (see Table 1), the following three component failure modes are identified in the VULCAN model:

1. Fail-To-Operate (FTO-failure) which means that the component is liable to respond unsuccessfully to an abnormal operating condition unless this is prevented by the use of fault-removal or fault-prevention techniques.

2. Spurious Operation (SO-failure) which means that the component is liable to initiate an automatic safety action without the presence of an abnormal operating condition, unless this is prevented by the use of fault-removal or fault-prevention techniques.

3. Non-critical (NC-failure) which means that maintenance of the component is required even though the component is not liable to lead to loss of the main service of the component (i.e., the failure will not lead to loss in safety

Table 1. VULCAN model output

Safety system service Performance measure Definition of performance measure

To respond successfully to an abnormal operating condition

Avoidance of automatic safety action without the presence of an abnormal operating condition

Avoidance of the need for fault removal actions

Critical Safety Unavailability (CSU)

Suprious Trip Rate (STR)

Repair Rate (RR)

Probability that the safety system, due to an unrevealed fault, will fail to automatically carry out a successful safety action on the occurrence of an abnormal operating condition

The number of nuisance activations per unit time

The number of replaceable hardware modules which need corrective maintenance per unit time

28 L. Bodsberg, P. Hokstad

performance nor spurious operation of the safety system).

4.2 Component failure cause

Failures of safety system components are categorized into 1) natural aging failure (physical degradation due to natural physical or chemical processes), 2) stress failure (physical degradation due to interaction with the physical or human environment, 3) intervention failure (change in component internal state, but no physical degradation), or 4) input failure (no change in component internal state). Natural aging, stress and intervention failures are denoted operational failures This classification scheme and the integrated treatment of hardware, software and human failures have been described in more detail by Bodsberg. ~

For instance, a sensor may fail to respond to an abnormal operating condition due to:

• natural physical or chemical degradation processes (natural aging failure)

• inadequate protection during sand blasting (stress failure)

• sensor head covered by a blanket (intervention failure)

• inadequate sensing principle (input failure).

The first two of these four failure categories are physical failures (requiring repair or replacement of component), and the last two are denoted functional failures (requiring modification, as changing location of the sensor or adopting another detection principle). Common cause failures can never be caused by natural aging and are relevant for the last three categories only.

The main principles of the VULCAN failure classification scheme correspond to other failure classification schemes presented in the literature (e.g., VillemeurS). The main merits of the VULCAN model is the explicit treatment of these categories in the mathematical formulae. When entering input data to the VULCAN model, the analyst must group component failure rates/probabilities with respect to failure mode and failure cause.

5 VULCAN MODEL FORMULAS FOR SUBSYSTEM

5.1 Vulcan block diagram

A special logic diagram has been developed for calculation of safety system reliability in the VULCAN model. This diagram is similar to a reliability block diagram (RBD), but the interpreta- tion of the blocks in the diagram is different (see Fig. 3). Blocks in the RBD diagram are related to

component failure (system faults) whereas blocks in the VULCAN diagram are related to (sub)system errors. The reason for the split between system faults and system errors is to explicitly show the effect of common cause failures.

The VULCAN block diagram (VBD) is established based on the RBD for the safety system. The blocks in the VBD refer to one specific error multiplicity (i.e., single, double, triple, or complete error). Further- more, the possible combinations of component failures contributing to a specific error multiplicity, are shown inside the block. A graphical distinction is made between single and multiple (double, triple, complete) errors. This distinction is important because regulations for the use of process safety systems tend to state that no single component failure should lead to a system failure. 6'7 The achievement of such an requirement is easily verified by looking at the VBD.

5.2 Model for common cause failure

The rate of errors of a subsystem made up of n redundant components may be estimated from component failure rate/probability and a parameter termed error multiplicity probability. The relationship between subsystem error rate/probability and component failure rate/probability is termed the VULCAN error model. The main principles of the VULCAN error model correspond to the a-factor model, s The parameter in the VULCAN error model is termed error multiplicity probability rather than alpha to signify that the parameter has a physical representation (fraction of errors). The error multiplicity probability is the conditional probability that exactly k out of n redundant components fail simultaneously in a common mode given that at least one component fails.

The relationship between the rate of subsystem FTO-errors, F A(k/n) and rate of component FTO- failures, A F in a subsystem of n components, is given by: 9

F _ _ F F np~). AF" A(k/,,) -- pkmA(,,) ---- (1)

2 jp;o j=l

There is a similar relationship between the rate of subsystem SO-errors and component SO-failures, and also between system TIE and component TIF. 9 A rationale for the common cause failure model used in the VULCAN model is given in the Appendix) °

5.3 CSU calculation of a subsystem

Figure 1 illustrates the rationale of the algebraic formulae for CSU calculation. At the top of the figure (part a), a subsystem is shown with a single (standard)


component (duration of manual tests is ignored). The terms available/unavailable in Fig. 1 apply only to operational failures of a component. Thus, input FTO-failures of a component are not considered in this figure. This failure category is, however, illustrated in Fig. 2 which is described in the next section.

Reference to the component/subsystem in part a will be given in the following discussion of the effect of fault-removal, fault-tolerant and fault-prevention techniques on the CSU of the subsystem (part b-e of Fig. 1).

Critical safety unavailability (CSU)

Time dependent CSU Average CSU

1

/ - - ~ - ~ ~ ~ - ~ ' - - 4 - Revealed(~,,r]2) in test

• Unrevealed in teat (TIF)

I I I I L ~ ~ T m e T T T ,~ T T T

Manual test interval x

Fig. 2. Effect of manual test on CSU.

5.3.1 Fault removal Most of the attention has been paid to manual tests when quantifying the effect of fault removal in the VULCAN model. It is assumed that components are as good as new when a detected fault is restored (i.e., having the same constant rates of failure as prior to the occurrence of the failure). System downtime during restoration and modification is ignored when calculating CSU (MTTR = 0).

a. SUBSYSTEM WITH STANDARD COMPONENT Subsystem stat . . . . . t safety performance iiiii!ii:!i!iil CSU

,v,,ab,e ~i!ii::)iii I:::::::

Unavailable | ] t l L i 1 I T T ~ T T T T T T T J Repair

Manuat test b. SHORTER TEST INTERVALS

I i i = Iiiii T T T T T T T T T T T T T T T T T T "

SELF-TEST

:: :::b ,:.:,:-:i

I = I = I k :?~:::J:Jii I = Time

T T T T T T T

d. REDUNDANCY (loo2 voting)

Stand. comp.

Redund. oomp.

L I I

1[ [iii7 T T T T T "

e. FAULT PREVENTION

T T T 11 T T T T T

Fig. 1. Effect of fault avoidance and fault tolerance on CSU.

The two aspects of manual tests the analyst have to consider when determining input data to the VULCAN model are: 1) Coverage of the manual test and 2) Manual test interval.

5.3.1.1 Coverage of manual test. The term coverage refers to the fraction of component failures which is revealed during the test. A high coverage means that the majority of all failures are revealed during the test.

If a manual test is used on a safety system, the unavailability will vary with time. If all faults are revealed during a manual test, the unavailability will be zero just after the test. It will then increase continuously until the next test. This is illustrated in Fig. 2, which shows the time dependent and average unavailability of one component, ignoring the time to repair, the duration of the manual test, and the duration of self-test.

A manual test reveals only component failures activated since the last test. Dormant component failures which are inherent by the delivery of the component and which are not revealed during the very first manual test, are not likely to be revealed during subsequent (identical) tests either. Thus, the manual test is only effective with respect to FTO-failures in which there is a change in component state (operational failures). In contrast, component input failures are only activated upon a real demand. Input failures are not revealed during a manual test because the conditions which activate the input FTO-failure are different from the test conditions.

Figure 2 indicates that input FTO-failures, which are not revealed during a manual test, result in a constant contribution to CSU, denoted Test- independent failure probability (TIF). The value of CSU immediately after completion of a manual test of the component equals TIF. The contributions to TIF could be, e.g., wrong location of sensor, lack of selectivity (wrong detection principle) of sensor, or the possibility that the test is carried out erroneously. In short, the TIF probability accounts for the fact that the functional test is not perfect, both because the test procedure will not in every detail simulate a true


demand (e.g., detecting improper location) and because the test occasionally may be carried out incorrectly.

To quantify the contribution to CSU from operational failures, an approximate formula is used, stating that the average asymptotic unavailability equals the failure rate multiplied by 1/2 of the manual test interval, u Considering all four failure categories, the overall effect of manual test on CSU of a single component I (i.e., voting lool) is given by (disregarding the effect of self-test and redundancy):

CSUi, m = ( A F + A F + A F ) 2 + T I F = A F 2 + TIF. (2)

Theoretically, the contribution to CSU from intervention failures which are introduced during a manual test of a component, equals the failure rate multiplied by the manual test interval. However, the factor 1/2 is also used for intervention failures in eqn (2) because it is assumed that most intervention failures occur upon test or maintenance of other components. The definition of the VULCAN failure categories gives a very direct relationship between coverage of manual test and CSU. In the VULCAN model, the effect on safety by improving the coverage of the manual test (e.g., the use of more representative test gas for gas detectors) is quantified by reducing the TIF in eqn (2).

5.3.1.2 Manual test interval. Manual test interval is of particular importance for field devices like sensors and actuators. Figure 1 (part b) illustrates the effect on CSU by testing a standard component twice as often (shorter manual test intervals). The figure indicates that manual test interval has effect on subsystem downtime, but it has no effect on time to component FTO-failure, assuming constant failure rate. According to eqn (2), CSU is approximately proportional to the manual test interval when TIF is low. In the VULCAN model, the effect on safety by varying the manual test interval is quantified by modifying the numerical value of the test interval in eqn (2).

The fact that CSU is reduced as the manual test interval is reduced, should imply that the test interval should be chosen as short as possible to increase safety. However, frequent testing may increase the number of faults introduced during test. This negative effect is not automatically handled in the VULCAN model. If the analyst believes that reduced test interval significantly increases the failure rate of a component, he should not just alter the manual test interval, but also increase the rate of stress and human intervention failures when entering model input data.

Shorter manual test intervals will increase cost for operation and maintenance. This effect should be calculated in the LCC model.

5.3.2 Fault tolerance The main elements of fault tolerance considered in the VULCAN model are self-test and redundancy.

5.3.2.1 Self-test. Self-test is applicable to field devices as well as logic control units. The objective of the self-tests is to detect automatically component failures before they have any effect on system service. The overall effect of self-test on CSU is similar to the effect of manual test; a component failure does not contribute to CSU after it is detected by the self-test.

Figure 1 (part c) illustrates the improvement in CSU by adding self-test to the standard component. In the standard component, CSU is dominated by the probability that a 'real' demand occurs before the component failure is revealed during manual test. If the self-test is continuous (e.g., loop monitoring) or the self-test interval is short (e.g., less than 4 hours), the probability that an abnormal operating condition will occur before the self-test is activated, is very low. Thus, failures detected by periodic loop testing are ignored in CSU formulae. As component failures which are detected by the self-test do not contribute significantly to the CSU, the total CSU becomes very much dependent on the rate/probability of component failures undetected by the self-test. In Fig. 1, the third failure is not detected by self-test. For instance, a gas detector filter which is covered by mud may prevent gas from reaching the sensor element. This stress FTO-failure will not be detected by the self-test.

The parameter, c r, termed coverage of self-test with respect to component FrO-failure, is introduced in the VULCAN model to quantify the effect of self-test on CSU. It is defined as the fraction of the component FTO-failure rate which is detected by self-test. The self-test coverage relates to the combined effect of continuous monitoring and periodic loop testing. The effect of self-test on CSU of a single component is given by:

= {(AF + ,~F)( 1 _ C e) + AF}2 + TIF CSUi, I/I

= a 2+ r/F. (3)

The underlying assumption in the VULCAN model is that self-test techniques are only effective with respect to physical failures such as natural aging and stress failures. However, the VULCAN formulae may easily be extended to include the effect of self-test on intervention failures as well, when such technology becomes available.

If the analyst has strong reasons to believe that there is a significant difference in the self-test coverage for natural aging and stress failures, he must use an average number and subsequently carry out uncertainty analyses in which he varies the self-test coverage.


Self-test involves the classic dilemma of 'who shall watch over the guardians?' All self-tests must be monitored; including the capability of the CPU to transmit a signal to the control room upon the detection of a component failure. Self-test hardware should be included as separate components in the principle electric circuit diagram if the self-test hardware is part of the main signal path from sensor to actuator. If self-test hardware is not included as separate components in the principle electric circuit diagram, the self-test coverage should incorporate the probability that the test equipment is in operation; i.e., self-test coverage should be multiplied with the probability that self-test hardware is in operation.

It should be stressed that self-tests do not detect input-failures such as design weaknesses or software failures. In particular, software failures are relevant for the logic control unit. The effect of increasing the coverage of self-tests above a 'reasonable' level is then very limited. In fact, due to increased complexity, it is not certain that a higher self-test coverage actually gives a better safety performance. The basic failure rates may be modified by It complexity factor to allow the analyst to evaluate this negative effect of implementing high coverage. Thus, if self-test is added to a standard component, the analyst may modify the standard failure rate (complexity factor) before he specifies the corresponding self-test coverage.

5.3.2.2 Redundancy. The effect of employing redundancy with loo2 voting is illustrated in Fig. 1 (part d). A duplex configuration is shown consisting of two components of the same type, one being the standard component. The duplex configuration fails to deliver specified service if both components are unavailable simultaneously. Considering natural aging failures only, the probability that both components are unavailable at the same time is very low. The main problem from a safety point of view is multiple system errors leaving both components unavailable at the same time (shaded area in Fig. 1). If 2002 voting logic is used instead of the loo2 voting logic in Fig. 1, safety performance will be lower because any component FTO-failure is liable to lead to a system failure.

The algebraic formula for the combined effect of fault-prevention, fault-removal and fault-tolerant techniques on CSU is derived in four steps. These steps are described in the following. Most attention is given to the two last steps because the two first steps have been described above. The steps in the development of the CSU formula from an RBD structure are illustrated in Fig. 4. A parallel structure of two components (A and B) is used as an example.

1. Consider effect of tests (self-test and manual test) on subsystem CSU by first decomposing each block in the RBD structure (e.g., failure of

component A) into the four disjoint failure categories: Natural aging (AN), stress (As), human intervention (An) and input failure (A/). The contribution to CSU from each block (see eqn (3)) is written below each block in the RBD structure.

2. Aggregate component failure categories of each component into 1) operational FTO-failure and 2) input FrO-failure. Operational failures which are not detected by self-test are denoted undetected faults, MFu (see eqn (3)). Concerning input FTO-failure, the contribution to CSU refers to probability (TIF) rather than a failure rate.

3. Consider effect of redundancy on CSU by developing a detailed Vulcan Block Diagram (see Fig. 3) which shows the error combinations which are liable to lead to subsystem failure (see Fig. 4). The possible combinations are deduced from the RBD structure using standard reliability theory. ]1 For instance, the loo2 RBD structure in Fig. 4 is liable to lead to system failure if both components fail simultaneously due to a common cause (double system error). For operational failures there is also the possibility that two independent component failures (single system errors) occur in the time interval between a manual test.

Note that the possibility that an FTO-failure occurs in one component during repair of another component (on-line maintenance) is ignored in Fig. 4. This approximation is applicable when MTTR is much shorter than the manual test interval. ]~ Thus, allowing for degraded operation has a rather insignificant effect on CSU provided that the fraction of time that the system is in a degraded operating status is short (repair time is limited to a few hours) and the production is maintained with a 'reasonable' voting logic for the period of degraded operation. The contribution to CSU from the double errors

SLngle

Single

Double

~Ac~d Double

Triple

Single system error involving failure of only component A

Single system error involving failure o! component A or B

Double system error involving simultaneous failure of component A and B due to a common cause

Double system error involving simultaneous failure of component A and B, A and C, or B and C due to a common cause

Triple system error involving simultaneous failure of component A, B and C due to a common cause

Complete system error involving simultaneous failure of more than three components due to a common cause

Fig. 3. Notation in VULCAN block diagrams.

32 L. Bodsberg, P.

RELIABILITY BLOCK DIAGRAM

1. Consider effect of tests

~_~ L~. (l-d=).d2 ~.; ( l-d)'r-J2 L~x/2 TIF b

2.

3.

4. Aggregate,

~,.~ (1 -ce)'r./2 ~.s F (1-6)x12 ~.~, x/2 TIF

Aggregate component failure categories

~/2

~.~'~/2 TIF

VULCAN BLOCK DIAGRAM

Consider effect of redundancy

A~(~) ~/2 ( 1/2AF0/2 ) ~ ) 2 /3 TIE (2/2)

2Pza 2p~ Z~x/2 ( 1/2 p,~ ~Fox) 2/3 TIF Pltz * 2Pz~ '~ Pl= * 2P~a Plrz * 2p~

disregarding single error combination

Double 2p~ (~,~ X/2 *TIF)

Pv~ * 2p~

Fig. 4. Derivation of CSU formulae from a loo2 RBD structure.

and the two joint single errors is listed below the VULCAN blocks. Note that CSU first must be presented in terms of subsystem error rates, A, using standard reliability theory. 11 Next, CSU is presented in terms of component failure rate, TIF and the error multiplicity probability using eqn (1). In the formulae in Fig. 4 it is assumed that the same error multiplicity probability applies on physical as well as functional component failures; including input failures. If the analyst, for instance, has strong reasons to believe that there is a significant difference in the error multiplicity probability for operational failures and input failures, he may decide to use different multiplicity distributions for the two failure categories. Keeping the contribution from the two failure categories separate, more detailed formulae may be derived. However, to reduce the number of parameters in the VULCAN model, it is recommended to use only one average

Hokstad

distribution weighted by the probability of occurrence of each failure category. Aggregate failure categories, disregarding single error combinations in parallel structures (simplified VBD). The purpose of deleting any combinations of single errors in a parallel structure is to enhance transparency and ease the use of work sheets. For most practical applications, a parallel structure will not contribute significantly to overall CSU. This is seen by inserting typical values for operational failure rates in the exact formula listed in part 3 of Fig. 4. Assuming that A F = 10 -6, Pu2=0.99 and r = 1 year, the contribution to CSU from two single errors occurring in the time interval between a manual test, is about 7% of the contribution from the double error.

Disregarding joint single error combinations in a parallel structure, a general formula for the CSU of a k-out-of-n reliability structure may be obtained which also considers the use of redundancy. By definition, a k-out-of-n RBD structure is liable to lead to a system failure if there are error multiplicities larger than n - k. Thus, CSU may be calculated by

CSUi,km = 2 (A~/m2+TIEu/,,). (4) ] = n - k + l

Expressing subsystem error rates in terms of component failure rates we get (see eqn (1):

CSU,,,m = 2 ( ~ J / " AF2+ npj,, TIE,,)) J="-* +' \,_~, JP.//,, -= ,=, ~JPi/,

n

(5)

Thus we get a very transparent expression for quantifying the effect of redundancy by inserting the CSU of a single component:

CSUi,k/n = A/~CSUi, H

where the effect of redundancy is given by:

(6)

n ~ P//n j = n - k + l

fk/, - (7)

jPj/n j=l

5.3.3 Fault prevention The effect of fault prevention on safety performance is illustrated in Fig. 1 (part e). The use of high reliability components instead of standard components reduces component CSU because the number of component


FTO-failures (A F and TIF) is lower. However, the period of time each FTO-failure is unrevealed is not reduced as long as the manual test interval remains the same.

In the VULCAN model the effect of fault prevention on CSU is quantified by adjusting the numerical value of the component failure rate/TIF in eqn (8). The effect of modification may be treated in a similar manner.

Thus, we have derived a transparent expression for quantifying safety performance, CSU; considering the effect of manual test, self-test, redundancy, and fault prevention, and considering all main failure categories occurring in field operation. When expressed in terms of all basic model parameters/variables, subsystem CSU is given by:

n ~ Pj/n j=n-k+l

CSUi,k/n --

JPi/. ]=t

Note that this formulLa only accounts for unavailability when it is not known that the system is unavailable (critical safety unavailability). In addition, the safety system may be unavailable by detected failures. This is a less critical unavailability as precautions can then be taken, e.g., to shut down the production. The contribution of the non-critical unavailability is thus highly dependent on the operational philosophy, see also comment after eqn (11) at end of Section 5.4.2.

5.4 STR calculation of a subsystem

The main steps in the development of the STR formula from an RBD structure are very similar to the main steps in the development of the CSU formula. In the following we will :focus on main differences between STR and CSU formula. Details are given by Bodsberg. 9

5.4.1 Fault removal Manual test of safety systems is only effective with respect to STR provided that there is a continuous physical degradation of system components (e.g., a steady drift in the output of an analog sensor). Any sudden change in system output which exceeds predefined set-points for automatic shutdown action, will immediately initiate a shutdown action. In the derivation of algebraic formulas for STR, the effect of manual test on STR is ignored.

It is well known that the system failure rate from a series RBD structure is obtained by adding the failure

rate of each block. 11 Thus, considering all four failure categories, the STR of component I, lool voting, is given by:

STRi:/1 = A s + As s + A s + A s = A s. (9)

5.4.2 Fault tolerance

5.4.2.1 Self-test. To prevent the generation of a system FTO-failure from a component FTO-failure in a subsystem with a single component, the self-test must not only detect the component SO-failure, but also initiate recovery actions so that the main output is unaltered (i.e., no trip signal must be provided to the next component in the signal path). This means that the self-test must continuously monitor component output signal and immediately initiate a test of the component upon a trip signal to determine whether the output signal is generated by an abnormal operating condition or generated by a component SO-failure. If the self-test resolves that the trip signal is generated by a component SO-failure, output signal should be unaltered and a fault condition signal should be transmitted to the operator station. Periodic loop test is not effective with respect to component SO-failures. The periodic loop test may also detect the component SO-failure, but to detect a component SO-failure after a spurious activation of the safety system, is considered to be of limited use. (It does, however, help maintenance engineers during fault identification, but this effect is ignored in the VULCAN model.)

The fact that recovery actions may be required to prevent an active SO-fault from generating a system SO-failure, requires a two-step approach in the modelling of the effect of self-test on STR. First, the parameter c s is introduced which is the coverage of the self-test with respect to component SO-failures. The parameter is defined as the fraction of the component SO-failure rate which is detected by the self-test. Next, the indicator variable R is introduced which characterizes the recovery capability of the component. The variable R is dependent on the degree of redundancy/voting logic and the operating philosophy of the component(s). The following notation is used:

R = 1, the self-test provides the component with the capability of recovering from SO-failures detected by self-test

R = 0, the self-test does not provide the component with the capability of recovering from SO- failures detected by self-test.

By adding the contribution to STR from each component failure category, we get the following formula for the STR of one specific component:

STR,.,/, = (A s + As)(1 - cS) a + A s + A s = A s. (10)


To maintain consistent notation between SO- and FTO-failures, an unrecovered failure is denoted an unrevealed error because it is assumed that manual test has no effect on SO-failures.

The analyst should note that self-test may reduce CSU as well as STR. (Keep in mind that the manual test is mainly effective with respect to CSU only.) There is generally different self-test coverage for FTO- and SO-failures because one specific self-test mechanism is only effective with respect to one of the two failure modes. Thus, to get a self-test coverage close to 1 for both FTO-and SO- failures, several self-test mechanisms must normally be used.

5.4.2.2 Redundancy. Standard reliability theory may be used to identify the system error combinations which are liable to lead to system SO-failure of a k-out-of-n RBD structure. Disregarding joint single error combinations in a parallel structure, a general formula for the STR of a k-out-of-n RBD structure may be obtained in a manner similar to the derivation of the CSU formula. When expressed in terms of basic model parameters/variables, STR is given by:

n

n j=n~k+l Pj/n STRi.km -

jPj,, j-1

× [(A s + AS)(al - c s) + A s + As]. (11)

Note that this formula only accounts for the spurious trips, which automatically occur by failure of equipment, according to their voting logic. In addition, there will be trips when failures (either FTO or SO) are detected, and the operational philosophy requires a shut down to be carried out to do repair. For redundant systems, these intentional shutdowns are often required only when there is a common cause failure affecting all redundant components, (otherwise on-line repair is carried out).

5.4.3 Fault prevention The effect of fault prevention on STR is similar to the effect of fault-prevention on CSU. The use of highly reliable components instead of standard components, reduce component failure rate and consequently system STR. The effect of fault prevention on STR is quantified by modifying the numerical value of the component failure rate in eqn (11). The effect of modification may be treated in a similar manner.

6 VULCAN MODEL FORMULAE FOR A SYSTEM

The CSU and STR of a safety shutdown system is calculated by adding the CSU and STR, respectively,

of the VBD of each subsystem. Thus, we have for small CSUi

J

CSUr = ~ CSUi (12) i - 1

J

STRT = ~ STRi. (13) i = 1

The adding of CSU is a first order approximation, valid for low probabilities (CSUi < 0.05). Considering the uncertainty of the various input parameters, the approximate formulae should provide adequate accuracy for most practical applications of the VULCAN model (irrespective of system size).

The quantification of RR is straightforward and the algebraic formula is presented for the overall safety system without any detailed derivation:

RR = ~] Ae. (14) Al l c o m p o n e n t s

6.1 Input data estimation

When the VULCAN model is used as a tool for identification of cost-effective configurations and operating philosophies (system optimization), any of the model parameters such as failure rate, TIF, error multiplicity probability, or self-test coverage may be looked upon as control variables.

When making predictions during the design phase, model input must be determined based upon experience from old systems using data handbooks and expert judgement. Attention has been given in the VULCAN model to relate model input to the type of reliability data which is available in the information systems presently used in the process industry.

6.2 System performance quantification

Calculation of system reliability may easily be carried out using eqn (8) and eqns (11)-(14).

When the VULCAN model is used for predicting the effect of various means for achieving reliability in safety shutdown systems, the general results will be as indicated in Table 2. According to this table, we cannot conclude on preferred means for improving reliability unless we know the dominating failure causes because some means for achieving reliability are only effective with respect to specific failure categories.

For instance, redundancy is very effective against natural aging failures. The technique is not that effective against operational failures due to excessive environmental stresses and human-made mistakes during design and operation; failures which often involve more than one component (dependent

Model for fault-tolerant safety systems

Table 2. Effect of means used to achieve reliability

35

Failure category Means used to achieve reliability (CSU & STR)

Fault prevention Fault removal Fault tolerance

Test Restoration Modification Redundancy Self-test

Natural Major Maj (CSU)/Min (STR) Major Major Major Major Stress Major Maj (CSU)/Min (STR) Major Major Moderate Major

Intervention Major Maj (CSU)/Min (STR) Major Major Moderate Minor Input Major Minor Minor Major Moderate Minor

failures). Thus, if it is known that inadequate design is the limiting factor from a reliability point of view, adding a redundant hardware or software component is of little use. The overall effect of redundancy may be improved by the use of diversity and segregation. The use of diverse redundancy reduces the probability of simultaneous failure of both components. However, the problem of failures due to improper specifications are not necessarily resolved by using diversity. Furthermore, diversity might give more problems during manual test, restoration and modifications.

7 M A I N F E A T U R E S OF THE V U L C A N M O D E L

Main features of the model are summarized in the following.

1. Explicitly addresses system vulnerability. The new model is named a VULnerability Calcula- tion (VULCAN) model rather than a reliability calculation model to highlight that system reliability is quantified based on the systems approach (open system perspective), and to highlight that attention has been given to quantify the effect of system environment on system reliability. In an open system perspective, system performance is related not only to inherent system characteristics, but also to environmental characteristics. In contrast, standard hardware reliability analyses are based on the closed system perspective, and system reliability is quantified considering only the effect of natural physical and chemical degradation processes (inherent reliability).

2. Explicitly addresses system suitability. System suitability refers to the ability of the system to deliver correct service, given there is no degradation in system components or structure. Inadequate system suitability is only revealed upon particular system input patterns These failures are denoted input-failures in the VULCAN model and the failure effect is

.

.

.

quantified by the parameter Test-independent- failure (TIF) probability. Given that a component will operate as designed when called upon (high availability), does not necessarily ensure that the component will be able to perform its specified service from an operational point of view. For instance, the specified service of a fire detector is to detect the presence of a fire. However, fire detectors detect only fire phenomenon like heat, smoke, flame, and one particular type of detector is generally suitable for detecting only one type of fire phenomenon. Installing detectors with a very low failure rate is of little help if the detectors are not suitable for the type of fire conditions occurring in the area. (Flame detectors are unsuitable for detecting fire conditions with a lot of smoke.) These considerations are seldom considered in Reliabi- lity, Availability and Maintainability (RAM) analyses. According to offshore field data, system vulnerability and suitability is as important as inherent reliability with respect to the reliability of hardware systems. Thus, in order to get a good agreement between predicted and observed system performance, characteristics of the system environment must be addressed in a realistic manner in the reliability model. Transparent and simple. A reliability evaluation of a process safety system may easily be carried out by hand with the use of the VULCAN model. Attention has been given to highlight important factors and suppress unimportant factors during VULCAN model development; making the model transparent and simple. Applicable in early design phase. Models, such as the VULCAN model, intended for the design phase of safety systems are particularly useful. The results of the VULCAN model may be used for trade-off analyses and to build reliability into the product in the first place which is generally accepted to be more cost-effective than removing faults during the operations phase. Quantifies the service actually delivered to the user. If vendors and users of process safety


systems are going to use reliability models as an important tool in decision making, model output must be related to the service the system delivers, and not only to inherent system characteristics. Thus, reliability models must distinguish between the ways a system can fail (failure modes). The VULCAN model is particularly suited for decision making because the reliability measures are related to safety performance (CSU), as well as life cycle cost (STR and RR).

6. Customized for reliability data collection and expert judgement. The confidence in reliability predictions is dependent on the quality of model input data. The use of field data is generally believed to give more certain predictions than expert judgement. The VULCAN failure taxonomy ensures that model input is directly related to the type of reliability data available in the information systems presently used in the process industry (e.g., data on natural aging and stress failures are recorded in preventive and corrective maintenance systems, whereas data on intervention and input failures are recorded in accident and near-miss reports as well as alarm and unexpected shutdown reports).

7. Considers hardware, software and human elements in an integrated manner. As the performance of safety systems depends on proper interaction between hardware, software and humans, all three system elements must be considered if you want predictions to reflect field performance. By using the VULCAN failure taxonomy, hardware, software and human reliability is quantified in an integrated manner.

8. Quantifies the effect o f fault removal and fault tolerance. Fault-removal and fault-tolerant techniques are particularly relevant for process safety systems. Thus, reliability models of safety systems must consider the effect of these techniques if you want reliability predictions to reflect field performance. The distinction between natural aging failures and the other three failure categories stress, intervention, and input failure is particularly useful when modelling the effect of fault-tolerant hardware techniques.

Although the VULCAN model is tailored for the assessment of computer-based process safety systems, the main features of the model should be applicable to other man-made systems as well.

The overall objective of the VULCAN model is to provide a means to achieve reliability. It is hoped that this model may stimulate further research in reliability modelling to defeat any tendency among system vendors and end-users to look upon quantitative reliability analyses only as a 'numbers game'.

ACKNOWLEDGEMENT

The paper is based on the research project 'Reliability and Availability of Computer-Based Process Safety Systems' and the author's doctoral work on the same topic. The research project has been sponsored by the Royal Norwegian Council for Scientific and Industrial Research (NTNF), BP Petroleum Development Norway, EB Industry and Offshore, GP-Elliott Electronic Systems, Honeywell, Norfass, Norsk Hydro, Norske Shell, Phillips Petroleum Company Norway, Saga Petroleum, Siemens, Simrad Albatross and Statoil. The doctoral work has been carried out with grants given by Conoco.

REFERENCES

1. Bodsberg, L., Comparative study of quantitative models for hardware, software and human reliability assessment. QRE Int., 9 (1993) 501-518.

2. Nasser, J., computer-assisted design enhancement of a PWR digital control system. EPRl-report NP-5819, EPRI, California, 1988.

3. Bodsberg, L. & Hokstad, P., A system approach to reliability and life-cycle cost of process safety systems. IEEE Trans. Reliab., 44 (1995) 179-186.

4. Hokstad, P., Flatten, P., Holmstr~m, S., McKenna, F. & Onshus, T., A reliability model for optimization of test schemes for fire and gas detectors. Reliab. Engng System Safety, 47 (1995) 15-25.

5. Villemeur, A., Reliability, Availability, Maintainabilty and Safety Assessment, John Wiley and Sons, Chichester, England, 1992.

6. Regulation concerning safety and communication systems on installations in the petroleum activities, Norwegian Petroleum Directorate, Stavanger, 1992.

7. API RP 14C, RecOmmended Practice for Analysis, Design, Installation and Testing of Basic Surface Safety Systems for Offshore Production Platforms, American Petroleum Institute, Dallas, USA, 1986.

8. Mosleh, A. & Siu, N. O., A multi-parameter common cause failure model. In Proc. 9th Int. Conf. Struct. Mech. Reactor Tech., Lausanne, Switzerland, 17-21 August 1987.

9. Bodsberg, L., VULCAN, a vulnerability calculation method for process safety systems. Dr Ing. thesis, Norwegian Institute of Technology, Trondheim, 1993.

10. Hokstad, P., Common cause and dependent failure modeling. In New Trends in System Reliability Evaluation, (ed. K.B. Misra) Elsevier, Amsterdam, 1993, pp. 411-444.

11. HCyland, A. & Rausand, M., System Reliability Theory, Models and Statistical Methods, John Wiley and Sons, NY, 1994.

12. Mosleh, Ali, Common cause failures: an analysis methodology and examples. Reliab. Engng System Safety, 34 (1991) 249-292.

13. Apostolakis, G. & Moieni, P., The foundations of models of dependence in probabilistic safety assessment. Reliab. Engng, 18 (1987) 177-195.

14. MIL-HDBK-217F, Military Handbook - - Reliability Prediction of Electronic Equipment, Rome Air Deve- lopment Center, New York, 1992


15. NPRD-3, Nonelectronic Parts Reliability Data, Rome Air Development Center., New York, 1985.

16. OREDA-92, Offshore Reliability Data, DnV, Norway, 1992.

17. Smith, D. J., Reliabi.!ity and Maintainability in persepctive, Macmillan Education, London, 1988.

18. FMD-91, Failure Mode/Mechanism Distributions, Rome Air Development Center, New York, 1991.

11. Any revealed failure is removed immediately. After restoration the component is 'as good as new' (having same rates of failure as prior to failure)

12. All hardware components may be repaired on-line; i.e., a degraded operating mode is possible.

APPENDIX

1 VULCAN model assumptions

1. The signal path in a computer-based process safety shutdown system may be decomposed into different subsystems of replaceable hardware components (e.g., detectors, input cards, central processing units, etc.). Each subsystem may consist of several redundant components which deliver an analogous service (e.g., two gas detectors).

2. Any fault in the system is related to failure of a specific hardware component. The hardware components may contain software.

3. Any fault in the system is related to failure of a specific hardware component. Any hardware component failure may be classified into exactly one of the three disjoint failure modes 1) FTO-failure, 2) SO-failure or 3) NC-failure.

4. Any hardware component failure may be classified into exactly one of the four disjoint failure categories 1) natural aging, 2) stress, 3) intervention, or 4) input failure.

5. Input FrO-failure has a constant probability of occurrence. Time to failure of the other combinations of failure mode and category is exponentially distributed.

6. Basic errors of a redundant subsystem are symmetric; i.e., the rate of double errors involving components A and B is equal to the rate of double errors involving components B and C.

7. All hardware components may contain self-test. For each particular component and failure mode, there is a constant fraction of the failure rate which is detected by the self-test.

8. There is a periodic: manual test of all hardware components. Redundant components are tested simultaneously (no staggered test).

9. During the manual test, all operational FrO-failures are revealed, but no input FrO-failures or SO-failures are revealed.

10. Hardware component failures are only removed if they are detected by self-test or revealed during manual test. Thus, the fact that any activation of the safety system represents a possibility for fault detection and consequently fault removal is ignored.

2 VULCAN model limitations To make the calculation model more transparent and also suitable for hand calculations, several approxima- tions have been made during derivation of the VULCAN formulae. The approximate formulae should provide adequate accuracy for most practical applications.It is difficult to give general guidance on the upper or lower limits of model parameters and variables which make the use of approximate formulas insufficient. The upper limit of one parameter such as failure rate is very much dependent on the value of other parameters such as error multiplicity probability.The following list provides some guidance on the boundary conditions for the VULCAN model parameters and variables. Adhering to all of the following requirements, the difference between the use of exact formulas and VULCAN (approximate) formulas is less than 10%.

1. The total rate of undetected FRO-, SO-failures F S (Av, Au) of a component must be less than 10 -6

per hour. 2. The total rate of physical failures (Ap) of a

component must be less than 10 -4 per hour. 3. The fraction of undetected FTO-, SO-common

cause failures of redundant subsystems must be greater than 1% of the total subsystem failures (i.e., pFn < 0.99 and p F < 0.99).

4. The fraction of undetected FTO-, SO-common cause failures of components in different subsystems must be less than 0.1% of the total subsystem failures.

5. Manual test interval (r) must be less than one year.

6. Self-test interval (Zs) must be less than one day. 7. Mean time to repair (MTTR) must be less than

one day.

3 Rationale for VULCAN error model The a-factor model 8"12 is a general model for dependent failures which allows any number of error multiplicities to be considered in the analysis. In the VULCAN error model, only four different error multiplicities are incorporated. Simultaneous common mode failure of more than three components due to a common cause, is treated as one error category and denoted a complete error. Thus, Pc/nF and Pc/ns are defined as the fraction of FTO- and SO-errors, respectively, involving simultaneous common mode


failure of more than 3 out of n redundant components due to a common cause. Today, it is difficult to get representative data for double and triple errors. Thus, it is not very useful to try to distinguish between error multiplicities larger than 3. Although the rate of system errors can be estimated in a direct manner, which is the main philosophy of the Basic Parameter (BP) model, ~3 an indirect approach has been chosen in the VULCAN model. The direct approach is difficult because there is today little published data on the rate of system errors in redundant systems. Most data handbooks ~4-~8 refer to component failures rather than system errors. Furthermore, it is reasonable to believe that quantitative system reliability analyses considering the effect of redundancy will have to be based on a large degree of expert judgement. Experts are, however, more successful in providing relative numbers rather than absolute numbers. Thus, it is easier for experts to estimate the occurrence of error multiplicities in terms of fractions rather than absolute numbers.The models described in the literature for indirect estimation of system error rates may be grouped into 1) component based models which define model parameters in terms of component failures and 2) error based models which define model parameters in terms of system errors. Well known models of the first category is the beta-factor model and the Multiple Greek Letter (MGL) model. ~3 Well known models of the second category are the Binomial Failure rate (BFR) model and the Multinomial Failure Rate (MFR) model] 3 An error based model is incorporated

in the VULCAN model because the model parameters have a very direct physical representation which simplifies expert judgement. It is easier to give statements on the fraction of single, double, triple and complete errors in a system of redundant components than to give statements on the conditional probability that the cause of a component failure will be shared by one or more additional components. Additionally, the estimation of the model parameters from field data is easier in the error based models than in the component-based models. ~3 Concerning error based models, the MFR model suits well the VULCAN failure taxonomy. This model explicitly distinguishes between independent failures (natural aging failures) and shocks that may fail more than one component (dependent failures-stress, intervention and input failures). Using the concept of this model, the error multiplicity probability will refer to the fraction of errors involving dependent failures only. In contrast, in the present VULCAN model, the error multiplicity refers to the fraction of all errors (independent and dependent). The MFR model is more demanding than the a-factor model concerning expert judgement. In the MFR model the expert must first split between the fraction of independent and dependent component failures before he can give the fraction of each error multiplicity. The a-factor model is incorporated in the VULCAN model because the expert can make direct judgement on the error multiplicity probability without first splitting between the fraction of independent and dependent failures.

Transparent reliability model for fault-tolerant safety systems

Documents

Transcript of Transparent reliability model for fault-tolerant safety systems