Analysis of failure and recovery rates in a wireless telecommunications system

19
Analysis of GSM Outage Data for Availability Parameters Draft Version 0.9 Steven M. Matz Larry Votta Mohammad Malkawi December 4, 2004

Transcript of Analysis of failure and recovery rates in a wireless telecommunications system

Analysis of GSM Outage Datafor Availability Parameters

Draft Version 0.9

Steven M. MatzLarry Votta

Mohammad Malkawi

December 4, 2004

Introduction

Modeling and predicting the availability of new platforms requires knowledge of certain key parameters, e.g.,mean time to failure (MTTF) and mean time to recover/repair (MTTR). One way to provide reasonable estimatesfor these parameters is to examine field performance data from currently deployed systems. We are particularlyconcerned here with the observed rate of outages in different failure classes (which maps to MTTF), the meanoutage duration (which is a measure of MTTR), and the probability of successful recovery (or coverage).

A critical additional question for modeling is: What are the actual failure and outage distributions? Simple Markovmodels assume that the outage durations and times-to-failure are distributed as pure exponentials; if this is not thecase, modeling the system assuming exponentials and the derived MTTRs/MTTFs will produce unreliable results.It is arguable that characterizing the shapes of the distributions is actually more important than the parameters,reflecting something fundamental about the way systems behave.

The following analysis attempts to characterize system failure and recovery behavior for a Motorola GSM systemwith 1288 BTSs, based on outage logs provided by Brooks Foley covering August 1999 through January 2000.The number of BTSs was roughly constant (±10 per month) over this period. These logs provide detailed doc-umentation of all system outages to the BTS level. Outage times (start, stop, and duration) are recorded with aresolution of one minute. The outages are divided into 6 “Problem” categories (Table 1): BSC Hardware, BSSSoftware, BTS Hardware, Environmental Related, TELCO Related, and Human Error. (Ten events logged in non-standard categories were combined into either the Environmental or TELCO classes, as appropriate.) No outageswere directly attributed to overload.

Outages are all measured at the BTS level. A single high level failure will result in an outage report for eachaffected BTS. For example, the “BSC Hardware” category is dominated by a complex BSC failure that producedoutages in 28 BTSs for 167 minutes each. This is a reasonable and consistent way to measure the system impactof an outage but the number of outages should not be interpreted as the failure rate of the BSC hardware alone.

In addition to the problem class and outage times, for each outage the logs contain the equipment site and loca-tion, BSC and BTS numbers, whether the outage was scheduled or unscheduled, and remarks which may furtherdescribe the outage/recovery.

Duplicate events were removed and the individual monthly spreadsheets were put into a consistent format andmerged. Results from preliminary analysis of these data are shown below.

Outage impact by failure type

The overall summary of the outages (8/1999 — 1/2000, outage times in minutes) is shown in Table 1.

Outage categories are taken directly from the outage logs; for BTS Hardware and BSS Software we have alsoshown the results further divided intoscheduledandunscheduledoutages. The quoted errors on mean outages arethe sample variances. The relative contribution of each failure type to total outage time is shown in Figure 1.

Figure 2 displays the distributions of the outage times in each of the five main outage classes (excluding the 28

DRAFT 10/23/2000 1

Figure 1. Outage time by cause, August 1999 — January 2000.

Human error

BSC Hardware

BTS Hardware

TELCO Related

BSS Software

Environmental Related

0 10 20 30 40 50

Percent of total outage time

attributed to Human Error) as a boxplot of the durations. A boxplot is a compact summary of the prominentfeatures of a distribution. Each subset of data is represented by a box divided by a line indicating the data median.The height of the box corresponds to the spread of the bulk of the data (the central 50%), with the upper and lowerends of the box being the upper and lower quartiles. The lengths of the vertical dashed lines indicate how stretchedthe tails of the distribution are; they extend to the standard range of the data, defined as 1.5 times the inter-quartilerange. Points lying beyond this range (“outliers”) are plotted individually.

External causes dominate, with the Environmental and TELCO categories accounting for 68.4% of the total re-ported outage time. We can compare this to Kuhn’s analysis of reported failures in PSTN [4] (see Table 2). Kuhn’scategories of “Human errors (others)”, “Acts of nature”, and “Vandalism” correspond to our Environmental andTELCO Related categories; in Kuhn these accounted for 59% of non-overload outage time, roughly comparableto what we see in the GSM data. Kuhn’s data are biased to long outages (> 30 min) of large switches and do notnecessarily reflect the total behavior of the systems we are interested in here. A more direct comparison can bemade by examining only the long (> 30 min) outages in the GSM data. The Environmental and TELCO cate-gories account for about the same fraction (71.6%) of the total outage time in these longer outages as they do inthe overall data set.

On the other hand, the “Human error” rate reported here is extremely low compared to Kuhn and other industrystudies of causes of telecom outages. Kuhn’s “human errors (company)” accounted for 25% of outages and 14% of

DRAFT 10/23/2000 2

Figure 2. Outage summary, GSM system data, August 1999 — January 2000

BSC HW BSS SW BTS HW Environ TELCO

050

010

0015

00

Out

age

time

(min

)

Note: one 3453 min Environmental outage not shown

customer minutes lost. Presumably in our data many of these types of errors have been placed in other categories.

The largest “internal” contributor to outage time is BSS Software, at 22.3%, followed by BTS Hardware (5.6%).These are the outages we will try to characterize in the following sections.

MTTF Estimates

Assuming a constant 1288 BTSs throughout this period (184 days, from Aug. 1, 1999 through Jan. 31, 2000), wecan derive estimated MTTFs for each class of failure from the outage data (Table 3).

These derived numbers assume an exponential process with a constant failure rate throughout the interval. In orderto examine the actual distribution of failures, we can perform survival analysis of the systems taking into accountthe censoring effects of the finite observation interval.

For the BTS Hardware failures, we determine for each BTS the time to first hardware failure (taking August 1,1999 as the starting time). BTSs which do not fail before January 31, 2000 are assigned a right-censored interval of

DRAFT 10/23/2000 3

Table 1. Summary of Outages in GSM system, Aug. 1999 — Jan. 2000Number of Mean BTS outage Median Fraction of all

Cause BTS outages duration (min) outage outage time

Environmental Related 1017 117.1± 6.0 52 50.9%BSS Software 2553 20.4± 1.0 2 22.3%

Scheduled 645 32.2± 2.5 6 8.9%Unscheduled 1908 16.4± 1.0 2 13.4%

TELCO Related 998 40.9± 3.2 3 17.5%BTS Hardware 179 73.5± 7.6 28 5.6%

Scheduled 72 76.1± 11.7 36.5 2.3%Unscheduled 107 71.7± 10.0 22 3.3%

BSC Hardware 224 36.2± 3.8 11 3.5%Human error 28 21.9± 0.7 20 0.3%

Table 2. Outages in PSTN, Apr. 1992 — Mar. 1994 (Kuhn 1997 [4])Number of Mean outage Customer Fraction of

Cause outages duration (min)106 minutes outage time

Human error (company) 77 149.4 2349.3 13.7Human error (others) 73 360.1 2415.8 14.1Acts of nature 32 828.2 3124.0 18.3Hardware failures 56 159.8 1210.8 7.1Software failures 44 119.3 355.5 2.1Overloads 18 1123.7 7527.2 44.0Vandalism 3 456.0 110.5 0.7

184 days. From this list of intervals we can derive a survival curve based on the Kaplan-Meier estimator [3, 2, 5].The survival curve (Figure 3) estimates the fraction of systems surviving (i.e., not failed) as a function of time.

Also shown in Figure 3 is a replotting of the estimated survival curve as log(-log(F(t))) vs. log(t); under thisscaling a Weibull process with a time-dependent failure rate

λ(t) = λp(λt)p−1

will produce a straight line with a slope equal to the shape parameterp and a log(t) intercept equal tolog(λ)p [2].In this case we can see that the data are fairly well-described by a line, following a Weibull process withp = 0.841and1/λ = 62900 hrs. (p < 1 produces a failure rate decreasing with time,p > 1 an increasing failure rate, andp = 1 reduces to the exponential process.)

The mean of the Weibull distribution is given byλ−1Γ(1+1/p), so the MTBF estimated from the fitted parametersis 68,900 hrs. This is significantly higher than the simple estimate shown in Table 3. If we include the time to failfor all hardware failures instead of just the first for each BTS, we get a MTBF of 49,200 hrs.

DRAFT 10/23/2000 4

Figure 3. Left: Kaplan-Meier estimator of the survival curve for first BTS Hardware failures, alongwith the 95% confidence intervals (dashed lines). The y-axis in on a log scale. Right: Same curve,scaled to show Weibull processes as straight lines, along with a line fit to the data.

0 1000 3000

0.80

0.85

0.90

0.95

1.00

Time (hr)

Sur

viva

l fra

ctio

n F

(t)

3 4 5 6 7 8

−7

−6

−5

−4

−3

−2

log(Time (hr))

log(

−lo

g(F

(t))

For software failures we consider time-to-failure not only for the first failure on each BTS but foreveryfailure(counting from the last software reset). This is justified since a software reset provides essentially a clean stateand restarts the failure clock. (We have further assumed thatany type of failure on the BTS implies a softwarereset.) Figure 4 shows the estimated survival curve for the unscheduled BSS Software failures. The right handplot in Figure 4 is the Weibull scaling of the same curve, showing that a Weibull process is a good model of theunscheduled software failures. The fitted line hasp = 0.443 and1/λ = 2570 hrs, giving a MTBF of 6600 hrs forunscheduled software failures. This is a rapidly decreasing failure rate, implying software failures are more likelynear the time of restart, possibly due to initialization problems. (Software aging, on the other hand, would producean increasingfailure rate during the operating interval.)

Note that we are considering the intervals between failures, independent of the location of the interval in the periodAugust 1999 through January 2000. This should not be confused with reliability growth measures which woulddetermine a systematic change in failure rate from the beginning to the end of the period covered by the logs.

DRAFT 10/23/2000 5

Figure 4. Left: Kaplan-Meier estimator of the survival curve for all unscheduled BSS Software failures,along with the 95% confidence intervals (dashed lines). The y-axis in on a log scale. Right: Samecurve, scaled to show Weibull processes as straight lines, along with a line fit to the data.

0 1000 3000

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Time (hr)

Sur

viva

l fra

ctio

n S

(t)

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−4 0 2 4 6 8

−6

−4

−2

0

log(Time (hr))

log(

−lo

g(S

(t))

There is a significant variation from month to month in the software failure rates (Table 4 and Figure 5). Thisvariation indicates that there are some changes either in the software or the system operating environment duringthis period which affect the software MTTF. Further investigation of possible changes in the system is necessaryto understand the observed fluctuations. In contrast to the software MTTF, the month to month variation of theHW MTTF (Figure 6) is reasonably consistent with statistical fluctuation about the overall mean.

The BTS outages attributed to BSS SW are due to either BTS or BSC failures. However, there aren’t any cases in

DRAFT 10/23/2000 6

Figure 5. Estimated BSS SW MTTF for unscheduled outages, by month, August 1999 — January2000. The dashed line indicates the derived SW MTTF for the entire period and the error bars are the90% confidence intervals. The monthly data are notconsistent with the assumption of a constant SWMTTF over this period. This implies that there is some time dependence in the operating environmentor the software itself (e.g., new versions loaded) which affected the failure rate.

Aug Sep Oct Nov Dec Jan

Month (1999−2000)

BS

S S

W M

TT

F (

hour

s)

020

0040

0060

0080

00

this class with a large number of simultaneous BTS outages as would be expected if the root cause were a BSCfailure, so we have assumed (somewhat conservatively) that all BSS SW outages are BTS SW failures.

MTTR Estimates

The second critical set of parameters for availability modeling is the recovery/repair times. Here we examine themean BSS Software and BTS Hardware outage times (Table 1) to estimate the mean time to recover (MTTR) forthese classes of failures.

We treat the scheduled and unscheduled BTS Hardware outages together since the mean and variance of the two

DRAFT 10/23/2000 7

Figure 6. Estimated BTS HW MTTF by month, August 1999 — January 2000. The dashed line indicatesthe derived HW MTTF for the entire period and the error bars are the 90% confidence intervals. Incontrast to the BSS SW failure rates, the HW MTTF data are consistent with a constant BTS HWfailure rate during this period.

Aug Sep Oct Nov Dec Jan

Month (1999−2000)

BT

S H

W M

TT

F (

hour

s)

050

000

1000

0015

0000

classes are similar (though the distributions do differ in detail).

One way of visualizing distributions is a quantile-quantile (Q-Q) plot [1], which pairs points of equal probabilityfrom two different distributions. A straight line on a Q-Q plot implies that the two distributions are from the samefamily. The distribution of all BTS HW outage durations is compared to an exponential distribution using a Q-Qplot in Figure 7. If the outage times were the result of an exponential process the data would fall along a straightline. Since this is not the case, a more complicated model of the process is necessary.

One possible model implied by Figure 7 is two exponential processes, each occurring for a fixed fraction of theBTSs. The parameters associated with this description are given in Table 5.

A more natural model can be found by applying the survival analysis process described in the preceding sectionto the hardware outage durations. Since all the hardware failures were repaired within the recorded interval, there

DRAFT 10/23/2000 8

Table 3. BTS MTTF by cause in GSM system, Aug. 1999 — Jan. 2000. Also shown are MTTF estimatesfor the sub-classes of scheduled and unscheduled outages.

Number of BTS outages Estimated BTS MTTF (hr)Cause Sched. Unsched. Overall Sched. Unsched.

Environ. Related 195 822 5600 29,000 6900BSS Software 645 1908 2230 8800 2980TELCO Related 61 937 5700 93,000 6100BTS Hardware 72 107 31,800 79,000 53,000BSC Hardware 1 223 25,400 5,700,000 25,500Human error 0 28 203,000 NA 203,000

All causes 974 4025 1140 5840 1410

Table 4. Monthly software outages and estimated MTTFs, Aug. 1999 — Jan. 2000Aug Sep Oct Nov Dec Jan

No. of BSS SW outages 821 531 255 268 413 265SW MTTF (hrs) 1167 1805 3758 3576 2320 3616

Unsched. BSS SW outages 548 435 168 191 332 234SW MTTF (unsched.) (hrs) 1749 2203 5704 5017 2886 4095

is no censoring. Figure 8 shows the estimated survival curve and the scaled data. In this case it appears that aWeibull distribution adequately describes the repairs, withp = 0.645 and1/λ = 54.8 min. This gives a MTBF of75.5 min. Note that in this case we are modeling arepair rate, andp < 1 implies that the time to repair increases(gets worse) with time.

Analyzing the BSS Software outages using a Q-Q plot, we find a significant difference between recoveries fromscheduled and unscheduled outages (Table 1). The duration distributions for both types of outages are plottedindividually in Figure 9. The quantile-quantile plot in the same figure clearly demonstrates the difference inthe two distributions in the deviation from the reference line, with the scheduled SW outage distribution beingbroader. This is also seen in the longest duration outages: approximately 33.5% (216/645) of the scheduled BSSSW outages were longer than 40 min., compared to only 7.2% (137/1980) of unscheduled outages.

Given the differences between the distributions, it is reasonable to treat the two subsets separately. Figure 10compares the distributions for scheduled and unscheduled BSS Software outages with reference exponential dis-tributions.

DRAFT 10/23/2000 9

Figure 7. Exponential Q-Q plots for BTS HW outage durations, GSM system data, August 1999 —January 2000. We also show two exponential distributions which together roughly describe the data.

0 1 2 3 4 5 6

010

020

030

040

050

0

BTS Hardware

Theoretical Quantiles

Out

age

dura

tion

(min

)

These plots clearly show that neither type of outage is correctly modeled by a single exponential. However, asshown by the fitted lines, two exponentials do provide a reasonable qualitative description of the distributions ineach case (though there are still significant residuals). The parameters characterizing the two exponentials aregiven in Table 5. This interpretation would imply two separate recovery paths selected in a fixed ratio, perhapsresulting from two different classes of software failures.

Again, survival analysis provides an alternative approach to these data. In the case of the scheduled BSS Softwareoutages, a Weibull distribution does describe the data (Figure 11), with a shapep = 0.516 and a MTTR of 31.6min. The same treatment of theunscheduledoutages, however, shows that these data donot follow a Weibulldistribution (Figure 12).

DRAFT 10/23/2000 10

Figure 8. Left: Kaplan-Meier estimator of the survival curve for BTS Hardware repair durations, alongwith the 95% confidence intervals (dashed lines). The y-axis in on a log scale. Right: Same curve,scaled to show Weibull processes as straight lines, along with a line fit to the data.

0 100 300 500

0.00

50.

020

0.10

00.

500

Repair time (min)

Sur

viva

l fra

ctio

n F

(t)

0 1 2 3 4 5 6

−3

−2

−1

01

log(Time (hr))

log(

−lo

g(F

(t))

DRAFT 10/23/2000 11

Figure 9. BSS SW scheduled(upper left) and unscheduled(upper right) outage durations < 40 min,GSM system data, August 1999 — January 2000. Below is a quantile-quantile plot comparing the twodistributions. The line in the Q-Q plot shows the expected relation for equal distributions. The twosets of data are not drawn from the same distribution, a conclusion confirmed by the Kolmogorov-Smirnov test.

Scheduled

Outage Duration (min)

Num

ber

of B

SS

SW

Out

ages

0 10 20 30 40

020

4060

80

Unscheduled

Outage Duration (min)

0 10 20 30 40

020

040

060

080

0

0 100 200 300 400 500

010

020

030

040

050

0

Q−Q Plot, Sched. vs. Unsched. BSS SW Outages

Scheduled Outage Durations (min)

Uns

ched

uled

Out

age

Dur

atio

ns (

min

)

Coverage

Another key parameters for predicting the availability of a system is the coverage, or probability of successfulrecovery (including repair or switchover) from a failure. In order to achieve high availability, coverages of wellover 90% are usually required.

In order to estimate the coverage for automatic recoveries from software failures, we examined the 2553 outagesattributed to BSS Software.

As Liang Yin pointed out [6], two failures close together within a short time interval on the same unit may be the

DRAFT 10/23/2000 12

Figure 10. Exponential Q-Q plots for BSS SW scheduled(left) and unscheduled(right) outage durations,GSM system data, August 1999 — January 2000. For each data set we also show two exponentialdistributions which together roughly describe the data.

0 1 2 3 4 5 6 7

010

020

030

040

0

Scheduled BSS Software

Theoretical Quantiles

Out

age

dura

tion

(min

)

0 2 4 6 8

010

020

030

0

Unscheduled BSS Software

Theoretical Quantiles

Out

age

dura

tion

(min

)

result of an unsuccessful repair of the original failure. However, failed repairs are not the only possible cause ofrelated outages. Some multiple failures may be the result of a persistent external fault which should not properlybe counted against the BTS. Detailed root cause analysis of each failure could resolve this ambiguity. Since wedo not have that information, we select as a proxy the subset of correlated outages which occur roughly on therecovery timescale, assuming that a failed recovery “promptly” produces a new outage.

To examine software failures and recoveries we select intervals of 10 and 20 minutes. Ten minutes is fairly longcompared to the mean auto-recovery outage duration (6.6 min); only about 10% (84/845) of SW failures which“Resumed to normal automatically” had outages greater than 10 min. Twenty minutes is longer than about 95%of this class of outage.

Based on the overall failure rate, the random probability of second but unrelated failure in the same BTS within10 minutes is about 0.015%. Thus we would expect about 0.37 observed by chance among the 2553 SW relatedoutages. In the 6 months of outage data, there are 9 cases where the original outage is attributed to “BSS Software”

DRAFT 10/23/2000 13

Figure 11. Left: Kaplan-Meier estimator of the survival curve for recoveries from scheduledBSSSoftware failures, along with the 95% confidence intervals (dashed lines). The y-axis in on a logscale. Right: Same curve, scaled to show Weibull processes as straight lines, along with a line fit tothe data.

0 100 300

0.00

10.

005

0.05

00.

500

Repair time (min)

Sur

viva

l fra

ctio

n F

(t)

0 1 2 3 4 5 6

−2

−1

01

2

log(Time (hr))

log(

−lo

g(F

(t))

DRAFT 10/23/2000 14

Figure 12. Left: Kaplan-Meier estimator of the survival curve for recoveries from unscheduledBSSSoftware failures, along with the 95% confidence intervals (dashed lines). The y-axis in on a logscale. Right: Same curve, scaled to show Weibull processes as straight lines, along with a line fit tothe data. In this case a Weibull process is not a good description of the data.

0 100 200 300

5e−

045e

−03

5e−

025e

−01

Repair time (min)

Sur

viva

l fra

ctio

n F

(t)

0 1 2 3 4 5 6

−0.

50.

00.

51.

01.

52.

0

log(Time (hr))

log(

−lo

g(F

(t))

DRAFT 10/23/2000 15

Table 5. Approximate bi-exponential descriptions of outage distributionsMTTR Fraction

Outage Type (min) of events

BSS SoftwareScheduled

Exp1 21.4 0.74Exp2 97.2 0.26

UnscheduledExp1 2.4 0.82Exp2 78.0 0.18

BTS HardwareExp1 41.8 0.65Exp2 134.5 0.35

and a second outage of the same BTS (from any cause) follows within 10 minutes. If we assume all of theseresult from a failed initial repair (=restart), we get a probability of successful recovery from SW failure of∼(2553−9)/2553 = 99.65%. For Poisson statistics, the 95% upper limit for the true mean given 9 observed countsis 15.71, giving a 95% lower limit of 99.38% for the coverage.

We can restrict consideration to automatic recoveries by trying to identify these from the “Remarks” included inthe outage logs. There were 911 recoveries referred to as “auto,” “automatic,” etc., in the Remarks. In a few casesit is slightly ambiguous whether they are auto-recoveries or not — e.g., 14 refer to GPROC LCF swaps. 892 ofthe comments contain “Resumed to normal automatically” and 2 more refer to “auto-reset.” So between 894 and911 of the 2553 SW related outages appear to be explicitly auto-repairs. With 9 failed recoveries this implies anauto-repair coverage of 99.0% (95% lower limit = 98.2%). Expanding the interval to 20 minutes produces a totalof 25 repeated recoveries after attempted auto-repair, implying a coverage of∼ 97.2%. (The 95% lower limit is96.1%.) The chance probability of an unrelated failure in the same BTS in this longer interval is still low.

This is the probability of success once an auto-repair is attempted. It is also relevant to consider the fraction ofSW failures that were recovered automatically. If we assume that all recoveries that were not explicitly calledautomatic required manual intervention, the auto/man repair fractions are about 35.7%/64.3% respectively. Themean outage duration for auto-recoveries was 6.6 min; for the remaining outages the mean duration was 28.1 min.

Conclusions

In this system we found a coverage of∼ 98% for auto-recovery from unscheduled BSS Software failures with anauto-repair fraction∼ 36%.

None of the processes we have examined (hardware and software failures, software recoveries, and hardware re-pairs) is adequately described by a single exponential process (and a single MTTR/MTTF). Empirical models ap-

DRAFT 10/23/2000 16

Table 6. Weibull parameters for Tier 1 GSM failure and recovery processesParameter Value

BTS Hardware Failuresp 0.8411/λ 62,900 hr

MTBF 68,900 hr

BTS Hardware Repairsp 0.6451/λ 54.8 min

MTTR 75.5 min

BSS Software Failures (unscheduled)p 0.4431/λ 2750 hr

MTBF 6600 hr

proximating the HW and SW repairs could be constructed with two repair processes (with the fitted rates/MTTRs)for each class of failure; the repair processes would be selected with the probabilities shown in Table 5. Most ofthe processes can be characterized by a Weibull distribution (Table 6); the exception is the durations of recoveriesfrom unscheduled BSS Software outages.

At this point we cannot identify a class of failure or type of repair that decomposes the outages into simplerdistributions. The observed shapes of the distributions suggests there might be an orthogonal decomposition ofthe outages into single exponential processes. Such a decomposition should be based on categories in the data(e.g., scheduled vs. unscheduled, automatic vs. manual, time of day, etc.) and should produce subsets of the datawith measurably different characteristics. Unfortunately no such decomposition of the GSM data is obvious. Forexample, one natural subset, unscheduled auto-recovered SW outages, still requires 2 exponentials.

There are a number of caveats that apply to the numbers we have derived. For example, durations are given only tothe minute and some outages are given as 0 minutes. Higher resolution reporting (one second or better) is requiredfor proper outage analysis, particularly for the rapid software recoveries. Further, in this analysis we have assumedall BSS Software failures are attributable to the BTS; it would be preferable that BTS and BSC software failuresshould be separate categories, as they are for hardware. In addition, the sample size and duration is limited, someproblems or behavior may be peculiar to the system involved, and the system configuration may not be stablethroughout the period. Other uncertainties associated with particular values (e.g., the observed variation in BSSSW MTTF) have already been discussed and should be kept in mind when interpreting the reported numbers.

An even more fundamental concern, though, is the process of assigning outages to one of the six specific categories.

DRAFT 10/23/2000 17

Current systems lack the instrumentation and reporting needed to track failures down to the HW/SW componentlevel. Even the coarse assignment of outages to hardware or software is often problematic and requires significantmanual analysis. However, even with the data available some of the categorizations raise questions. By reviewingthe Remarks associated with the outages we see that, for example, many of the scheduled BSS Software outages donot appear to be directly related to software. These apparent conflicts reflect not only the state of instrumentationbut also 1) the lack of consistent definitions/standards for outage categories and 2) the inherent complexity of thesystems. These issues need to be addressed in order to provide meaningful and consistent availability accounting.

References

[1] J. M. Chambers, W. S. Cleveland, B. Kleiner, and P. A. Tukey.Graphical Methods for Data Analysis. Chapman andHall, 1983.

[2] J. D. Kalbfleisch and R. L. Prentice.The Statistical Analysis of Failure Time Data. John Wiley and Sons, 1980.[3] E. L. Kaplan and P. Meier. Nonparametic estimation from incomplete observations.J. Am. Stat. Assoc., 53:457–481,

1958.[4] D. Kuhn. Sources of failure in the public switched telephone network.IEEE Computer, 30(4):31–36, April 1997.[5] W. N. Venables and B. D. Ripley.Modern Applied Statistics with S-Plus. Springer-Verlag, 1994.[6] L. Yin. 4NINES Availability Modeling for the Radio Subsystem of the GSM Network — Data Analysis. Motorola

internal memo, June 28 2000.

DRAFT 10/23/2000 18