Fault localization in bayesian networks

24
Universiteit van Amsterdam IAS intelligent autonomous systems Submitted to: IJAR IAS technical report IAS-UVA-06-03 Fault Localization in Bayesian Networks Jan Nunnink and Gregor Pavlin Intelligent Systems Laboratory Amsterdam, University of Amsterdam The Netherlands This paper considers the accuracy of classification using Bayesian networks (BNs). It presents a method to localize network parts that are (i) in a given (rare) case re- sponsible for a potential misclassification, or (ii) modeling errors that consistently cause misclassifications, even in common cases. We analyze how inaccuracies in- troduced by such network parts are propagated through a network and derive a method to localize the source of the inaccuracy. The method is based on moni- toring the BN’s ‘behavior’ at runtime, specifically the correlation among a set of observations. Finally, when bad network parts are found, they can be repaired or their effects mitigated. Keywords: Bayesian networks, Fault localization, Classification.

Transcript of Fault localization in bayesian networks

Universiteitvan

Amsterdam

IASintelligent autonomous systems

Submitted to: IJAR

IAS technical report IAS-UVA-06-03

Fault Localization in Bayesian Networks

Jan Nunnink and Gregor PavlinIntelligent Systems Laboratory Amsterdam,University of AmsterdamThe Netherlands

This paper considers the accuracy of classification using Bayesian networks (BNs).It presents a method to localize network parts that are (i) in a given (rare) case re-sponsible for a potential misclassification, or (ii) modeling errors that consistentlycause misclassifications, even in common cases. We analyze how inaccuracies in-troduced by such network parts are propagated through a network and derive amethod to localize the source of the inaccuracy. The method is based on moni-toring the BN’s ‘behavior’ at runtime, specifically the correlation among a set ofobservations. Finally, when bad network parts are found, they can be repaired ortheir effects mitigated.Keywords: Bayesian networks, Fault localization, Classification.

Fault Localization in Bayesian Networks Contents

Contents

1 Introduction 1

2 Bayesian networks and classification 12.1 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Classification Accuracy 3

4 Fault Probability 44.1 Fault Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.1.1 Cause 1: Rare Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.1.2 Cause 2: Modeling Inaccuracies . . . . . . . . . . . . . . . . . . . . . . . . 64.1.3 Cause 3: Erroneous Evidence . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Reinforcement Propagation 65.1 Reinforcement Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6 Fault Monitoring 86.1 Factor Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96.2 Consistency Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96.3 Estimation of the Summary Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 9

7 Fault Localization Algorithm 117.1 Determining the Cause Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

8 Experiments 138.1 Synthetic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138.2 Real World Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

9 Applications 159.1 Localizing Faulty Model Components . . . . . . . . . . . . . . . . . . . . . . . . . 159.2 Deactivating Inadequate Model Components . . . . . . . . . . . . . . . . . . . . . 16

10 Discussion 1610.1 Non-Tree Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1610.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1710.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Intelligent Autonomous SystemsInformatics Institute, Faculty of Science

University of AmsterdamKruislaan 403, 1098 SJ Amsterdam

The Netherlands

Tel (fax): +31 20 525 7461 (7490)http://www.science.uva.nl/research/ias/

Corresponding author:

Jan Nunninktel: +31 20 525 7517

[email protected]://www.science.uva.nl/∼jnunnink/

Copyright IAS, 2006

Section 1 Introduction 1

1 Introduction

Discrete Bayesian networks (BNs) are rigorous and powerful probabilistic models for reasoningabout domains which contain a significant amount of uncertainty. They are often used asclassifiers for decision making or state estimation in a filtering context. BNs are especially suitedfor modeling causal relationships [15]. Using such a causal model, we propagate the values ofobserved variables to probability distributions over unobserved variables [14, 7]. These posteriordistributions are often the basis for a classification process. Moreover, in many applications theclassification result is mission critical (e.g. situation assessment in crisis circumstances).

In this context, we emphasize the difference between the generalization accuracy and theclassification accuracy. In general, classification is based on models that are generalizationsover many different situations. Causal BNs capture such generalizations through conditionalprobability distributions over related events. However accurate generalizations do not guaranteeaccurate classification in a particular case. In a rare situation, a set of observations could resultin erroneous classification, even if the model precisely described the true distributions. Namely,in this particular situation a certain portion of the conditional distributions does not ‘support’accurate inference, since it does not model the rare case. By using such inadequate relations forinference, the posterior probability of the true (unobserved) state would be reduced. In addition,the probability of encountering situations for which a modeled relation is inadequate increaseswith the divergence between the true and the modeled probability distributions.

Inadequate relations influence the inference process and subsequent classification. This isreflected in the way the model ‘behaves’ under different circumstances. We can monitor a BNand from its behavior draw conclusions about the existence of one of inadequate parts. Thisis based on the following principle: A given classification node splits the network in severalindependent fragments. These fragments can be seen as different experts giving independent‘votes’ about the state of that node. The degree to which they ‘agree’ on a state is a measurefor the accuracy of the classification. The data conflict approach by Jensen et al. [8] (as wellas [10] and [9]) uses a roughly similar principle. It is based on the assumption that given anadequate model all evidence should be correlated and hence the joint probability of all evidenceshould be greater than the product of the individual evidence probabilities. More discussionabout related approaches can be found in Section 10.2.

Using our measure, we present a method for localizing possible inaccuracies in general, andshow how to use this method to determine the type of cause. The advantage of this method overprevious work is that we can estimate a lower bound on its effectiveness, and that this lowerbound has asymptotic properties given the network topology. This allows one to determinefor which networks localization works best. Furthermore, the localization procedure is morestraightforward than the existing methods.

Possible applications for the proposed method are presented in Section 9. Among its usesare:

• Localized model errors can be manually corrected or relearned.

• Localized inaccurate information sources can be deactivated at run-time to improve theclassification.

2 Bayesian networks and classification

A Bayesian network BN is defined as a tuple 〈D, p〉, where D = 〈V, E〉 is a directed a-cyclic graph(DAG) consisting of a set of nodes V = {V1, . . . , Vn} and a set of directed edges 〈Vi, Vj〉 ∈ Ebetween pairs of nodes. Each node corresponds to a variable and p is the joint probability

2 Fault Localization in Bayesian Networks

/.-,()*+A

½½444

¦¦®®®

76540123B

¥¥ªªª

76540123C

¦¦­­­

76540123H

¥¥­­­

½½555

76540123E 76540123F 76540123G

Figure 1: An example Bayesian network. Nodes {A,B,C, E} represent fragment FH1 , which is

rooted in node H. H has a branching factor of 3. A has branching factor 2.

distribution over all variables, defined as

p(V) =∏

Vi∈Vp(Vi|π(Vi)),

where p(Vi|π(Vi)) is the conditional probability table (CPT) for node Vi given its parents in thegraph π(Vi). In this paper we use p to specifically refer to estimated values, such as modelingparameters (CPTs) and posterior probabilities, while we use a p (without hat) for the trueprobabilities in the modeled world. We assume that all probabilities that we can estimate havea corresponding true value in the real world.

Each variable has a finite number of discrete states (or values), denoted by lower case letters.The DAG represents the causal structure of the domain, and the conditional probabilities encodethe causal strength.

Furthermore, we will denote a set of observations or evidence about the state of variablesby E , the main classification or hypothesis node by H, and the states of that node by hi. The(hidden) true state of H is denoted by h∗.

A variable H is classified as the state hi for which

hi = arg maxhj

p(H = hj , E). (1)

p(H = hj , E) is obtained from the joint probability distribution by marginalizing over all vari-ables except H and those in E , and factorizing the joint probability distribution, as follows:

p(H = hj , E) =∑

V\H

Vi∈Vp(Vi|π(Vi)) ·

e∈Ee. (2)

We distinguish between evidence variables and non-evidence variables. Where a non-evidencevariable (except H) is used in the equation, we marginalize it out by summing over its states.Where an evidence variable appears, it is replaced by its observed state. This is done throughmultiplication with the vector e, which contains 1s at the entries corresponding to the observedstates and 0s elsewhere. Note that this involves the multiplication of potentials. See also [7],section 1.4.6.

2.1 Factorization

For the analysis in the next sections we require a notion of dependence between different partsof a network given the classification variable. We can use d-separation [14] to identify fragmentsof the DAG which are conditionally independent given the classification node.

Definition 1 Given a DAG and classification node H, we identify a set of fragments FHi

(i = 1, . . . , k) which are all pairwise d-separated given H. A fragment is defined as a set ofnodes that includes H. Hence, the nodes in a fragment are conditionally independent of nodesin the other fragments given H. Node H is called the root of all fragments FH

i . The number offragments rooted in H (k) is called the branching factor.

Section 3 Classification Accuracy 3

See Figure 1 for an example (we will use this example throughout the paper). Given node H,3 fragments can be seen, namely (i) nodes {A,B, C,E, H}, (ii) nodes {F, H}, and (iii) nodes{G,H}. If A would be the classification variable, we could identify 2 fragments, namely (i)nodes {A,C, E}, and (ii) the rest of the nodes plus A.

This fragmentation has the useful property that the classification equation (2) can be fac-torized such that each factor corresponds one-on-one with a fragment. By splitting the sum andproduct and regrouping them per fragment, we rewrite (2) as

p(H = hi, E) =∑

V\H

Vi∈Vp(Vi|π(Vi)) ·

e∈Ee

=p(hi|π(H))∑

V1\H

Vi∈V1\Hp(Vi|π(Vi)) ·

e∈E1e·

}φ1(hi)

· · · ...∑

Vk\H

Vi∈Vk\Hp(Vi|π(Vi)) ·

e∈Ek

e,}

φk(hi) (3)

where we partition the complete set of variables V into k subsets Vi each of which consists of thenodes in the corresponding fragment FH

i . Ei denotes the subset of evidence in fragment FHi .

We can identify factors φj(hi) (j = 1, . . . , k) whose product is the joint probability for each hi.ΦH denotes the set of all factors associated with node H. φj(hi) denotes the value of factor φj

for hi. The d-separation between the fragments directly implies the following:

Proposition 1 The factors φj in (3) are mutually independent, given classification variable H.

A factor is independent in the sense that a change in its value does not change the value ofanother factor.

For example, consider the BN shown in Figure 1. H is the classification variable and theevidence is E = {e1, f1, g2}. The fragments are FH

1 = {A,B, C, E, H}, FH2 = {F, H} and

FH3 = {G,H}, and evidence sets for the fragments are E1 = {e1}, E2 = {f1} and E3 = {g2}. The

factorization becomes

p(H = hi, E) =

φ1(hi)︷ ︸︸ ︷∑

A

p(A)∑

C

p(C|A)p(e1|C)∑

B

p(B)p(hi|A,B)

φ1(hi)︷ ︸︸ ︷p(f2|hi)

φ3(hi)︷ ︸︸ ︷p(g2|hi) . (4)

3 Classification Accuracy

Recall that h∗ denotes the true (hidden) state of variable H and that we classify H accordingto (1). We define the accuracy of a classification as follows:

Definition 2 A classification of variable H, given evidence E, is accurate iff

h∗ = arg maxhi

p(H = hi, E). (5)

In other words, the true state should have the greatest estimated probability.It is difficult to directly analyze the (in)accuracy and its causes from this definition. There-

fore, we plug the factorization from the previous section into (5). Recall that φi(hj) is thevalue of factor φi for state hj of the classification node H, given evidence Ei. Since factorsφ1(hj), . . . , φk(hj) are mutually conditionally independent, we can define the following accuracycondition for factors:

4 Fault Localization in Bayesian Networks

Definition 3 Factor φi supports an accurate classification iff

h∗ = arg maxhj

φi(hj). (6)

We say that φi reinforces state arg maxhj φi(hj) of H.

In other words, we call a factor accurate if it gives the true state a greater value than all otherstates of H. The intuition behind this is that a factor is accurate if it contributes to an accurateclassification, which is made clear through the following observation:

Consider a BN and factorized probability distribution p(H, E). Suppose we augment itsDAG by adding a new fragment FH

i which corresponds to a new factor φi in the factorization.If φi satisfies Definition 3, then it will contribute towards obtaining an accurate classification byincreasing the probability for the true state h∗ relative to all other state hk 6= h∗. Let E ′ denotethe union of the original evidence E and the evidence from the new fragment. The relativeprobability for H can be expressed as

∀hk 6= h∗ :p(h∗, E ′)p(hk, E ′) =

φi(h∗)φi(hk)

p(h∗, E)p(hk, E)

>p(h∗, E)p(hk, E)

. (7)

This inequality is obvious since φi(h∗)/φi(hk) > 1 for all hk 6= h∗, if φi satisfies Definition 3.Summarizing, a classification is accurate as long as the joint probability for the true state is

the greatest. This happens if a sufficient number of factors contribute to an accurate classificationby satisfying Definition 3.

4 Fault Probability

In Definition 3, we say that a factor does not support accurate classification if, given the evidence,it does not satisfy Condition (6). In that case, we call the factor or corresponding fragmentinadequate for the current classification task. We will also use the term fault to denote the sameviolation of (6).

A factor φi(H) is obtained by combining parameters from one or more CPTs from thecorresponding network fragment FH

i . This combination depends on the evidence which in turndepends on the true distributions over modeled events. Thus, with a certain probability weencounter a situation in which factor φi(H) is adequate, i.e. it satisfies condition (6). We canshow that this probability depends on the true distributions and simple relations between thetrue distributions and the CPT parameters.

We can facilitate further analysis, by using the concept of factor reinforcements to charac-terize the influence of a single CPT. For the sake of clarity, we focus on diagnostic inferenceonly. For example, consider a CPT p(E|H) relating variables H and E. If we assume thatone of the two variables was instantiated, we can compute a reinforcement at another relatedvariable. For the instantiation E = e∗, one of the factors associated with H can be expressedas φi(H) = p(e∗|H). In this situation factors are identical to CPT parameters and adequacy ofCPTs can be defined.

If all CPTs from FHi were adequate in a given situation, then also φi would be adequate.

This is often not the case, however. Whether a factor φi is adequate depends on which CPTsfrom the corresponding fragment are inadequate. Obviously, the higher is the probability thatany CPT from a fragment FH

i is adequate, the higher is the probability that FHi is adequate as

well.We need to assume a lower bound pre on the probability that in a certain case a single CPT

is adequate and supports accurate reinforcement. In this section we argue that pre > 0.5 is aplausible assumption. Consider a fragment FH

i consisting of only two adjacent nodes H and

Section 4 Fault Probability 5

h1 h2

b1 0.7 0.4b2 0.2 0.3b3 0.1 0.3

Table 1: Example p(B|H).

E, whose relation is defined by a single CPT. Definition 3 implies that FHi is adequate if the

corresponding factor φi(hj) is greatest for the true state h∗. Thus, the CPT is adequate if thestate h∗ causes such evidence ei that, after instantiation of the corresponding variable in FH

i ,factor φi(h∗) is greatest.

For each state hi of H we first define a set of states of E for which the CPT parameterssatisfy condition (6):

Bhi = {ek|∀hj 6= hi : p(ek|hi) > p(ek|hj)}.In addition, for each possible state hi we can express the probability phi that a state from Bhi

will take place:phi

=∑

ej∈Bhi

p(ej |hi), (8)

where p(ei|hj) describe the true distributions. In other words, phi is the probability that causehi will result in an effect for which the CPT parameters satisfy (6). pre is defined as the lowerbound on the probability that a CPT p(E|H) will be adequate: pre = mini(phi).

For example, let’s assume a simple model consisting of two nodes B and H related through aCPT shown in Table 1, which is identical to the true probabilities in the domain. Suppose thath2 is the (hidden) true state of H, so h∗ = h2. For the corresponding factor φ(H) = p(B|H) tobe adequate, h2 should cause evidence bk such that the factor reinforces h2 (see Definition 3).We can see that the evidence sets for which h2 = arg maxφ(hi) are {b2} and {b3} (in this casearg maxφ(hi) returns the maximum value on a row of the CPT). The probability that either ofthese sets is caused by h2, p([b2 ∨ b3]|h2) = 0.6. Similarly, if h1 were the true state of H, thenwe would get that p(b1|h1) = 0.7. Thus, whichever the true state of H, for this example CPTwe get that the probability that factor φ(H) is adequate, pre, is at least 0.6.

A consequence of Definition 3 is that pre does not change even if the values in the CPTwould change to p(bk|hj) 6= p(bk|hj), as long as simple inequality relations between the CPTvalues and the true distributions are satisfied:

∀bk : arg maxhj

p(bk|hj) = arg maxhj

p(bk|hj), (9)

where p denotes the true distribution in the problem domain. Note that this relation is verycoarse, and we can assume that it can easily be identified by model builders or learning algo-rithms. For a more thorough discussion see [13].

4.1 Fault Causes

A CPT does not support accurate classification in a given situation, i.e. it is inadequate, if itdoes not satisfy Condition (6). We identify three types of faults that cause inadequacies.

4.1.1 Cause 1: Rare Cases

Suppose a CPT is correct in the sense that the CPT parameters are sufficiently close to thetrue probabilities in order to satisfy (9). In a rare case, however, the true state is not the most

6 Fault Localization in Bayesian Networks

likely state given the effect b∗ that materialized: h∗ 6= arg maxhjp(b∗|hj). Then, the case can

get misclassified, since the model reinforces the most likely state.As an example, consider a simple domain where the distribution over binary variables F

(fire) and S (smoke) is given by p(s|f) = 0.7, p(s|f) = 0.7 and p(f) = 0.5. If the world wouldbe in the rare case {f, s} where we observe S = s, inference would decrease the probability ofthe true state f , violating Condition (6).

4.1.2 Cause 2: Modeling Inaccuracies

Alternatively, CPT parameters might not satisfy (9). Then, if a case is common, the true stateof H is not reinforced. By considering rationale from the previous section, we assume that thisfault type is not frequent.

In other words, consider a fragment containing evidence bi. If the true probabilities satisfyp(bi|h∗) > p(bi|hi) for all i, but the model parameters satisfy p(bi|hi) > p(bi|h∗) for some i, theCPT does not support accurate classification. We call this a model inaccuracy.

4.1.3 Cause 3: Erroneous Evidence

The evidence inserted into a BN is typically provided by other systems, such as sensors, databasesor humans. Observation and interpretation of signals from the world can, however, be influencedby noise or system failures, possibly leading to wrong classifications.

5 Reinforcement Propagation

We introduce a coarse inference algorithm, which propagates factor reinforcements through a treestructured DAG. It only propagates reinforcements from leaves to roots, i.e. it only ‘collects’evidence for diagnostic inference. As we show later, with the help of this algorithm we canmonitor a model’s runtime ‘behavior’, which can give clues about the adequacy of CPTs. Thealgorithm is based on the concept of a factor reinforcement, which was already mentioned inDefinition 3 but is made more formal here.

Definition 4 (Factor Reinforcement) Given a classification variable H, a fragment FHi and

some instantiation of the evidence variables in FHi , we define the corresponding factor reinforce-

ment RH(φi):RH(φi) = arg max

hj

φi(hj). (10)

In other words, reinforcement RH(φi) is a function that returns that state hj of variable H,whose probability is increased the most (i.e. is reinforced) by instantiating nodes of fragmentFH

i . For example, given factorization (4), we obtain three reinforcements for H. If a factor φi

is accurate (see Definition 3), then RH(φi) = h∗.Moreover, we can count for any node H how many of its fragments reinforced each of its

states. Let ni be the number of factors reinforcing state hi. We call N = {n1, . . . , nm} the setof reinforcement counters, where m is the number of states of H. ni is defined as

ni =‖ {φj ∈ ΦH |hi = RH(φj)} ‖, (11)

where ‖ · ‖ denotes the size of a set. Suppose that in our running example the reinforcementswere h1, h2 and h1. If H would have 3 states then N = {2, 1, 0}.

Next, classification chooses that state hi which got reinforced by the most factors, i.e. whichhad the greatest reinforcement counter:

Section 5 Reinforcement Propagation 7

Definition 5 (Reinforcement Summary) The reinforcement summary SH of a node H isdefined as:

SH = hi s.t. ∀j 6=ini ≥ nj , (12)

If H is an evidence node then SH is defined as the observed state of H.

In our example, where N = {2, 1, 0}, we get SH = h1.For BNs with tree structured DAGs we can summarize the definitions presented above into

a coarse inference process. We assume that the evidence nodes are the tree’s leaves and itsclassification node is the tree root. Consider a set V consisting of all leaf nodes. We define a setP = {Ni /∈ V|children(Ni) ⊆ V} consisting of all nodes not in V whose children are all elementsof V. For each parent Y ∈ P we determine the reinforcement summary SY resulting from thepropagation from its children. Every parent node Y is then instantiated as if the reinforcementsummary state returned by SY were observed. We then set V ← V ∪ P and the procedure isrepeated until the reinforcement summary is determined at the root node H. This implies thatat all times all nodes in the set V are leaves and/or instantiated nodes, so the reinforcementsummaries can be computed. This procedure can be summarized by Algorithm 1.

Algorithm 1: Reinforcement Propagation AlgorithmCollect all leaf nodes in the set V;1

Find P = {Ni /∈ V|children(Ni) ⊆ V}, the set of nodes whose children are all in V;2

if P 6= ∅ then3

for each node Y ∈ P doFind set σ(Y ) of all instantiated children of Y ;for each node Xi ∈ σ(Y ) do

Compute reinforcement RY (φi) at node Y caused by the instantiation of Xi;endCompute reinforcement summary SY at node Y ;Instantiate node Y as if SY were observed (hard evidence);

endMake parent nodes P elements of V: V ← V ∪ P;

elseStop;

endGo to step 2;4

With this algorithm, we obtain SX for all unobserved variables X by recursively usingDefinition 4 and 5.

In BNs with tree-like DAGs and binary nodes, Algorithm 1 corresponds to a system ofhierarchical decoders that implement the repetition coding technique known from informationtheory (see for example [11], Chapter 1). This implies asymptotic properties. Namely, giventhe pre > 0.5, as the branching factors increase, the probability that SH = h∗ at any nodeapproaches 1. This property can be explained through binomial distributions, as we will showin the next section.

5.1 Reinforcement Accuracy

While pre > 0.5 is a lower bound on the probability that a particular CPT provides an accuratereinforcement, pf denotes the probability that a factor reinforcement, resulting from Algorithm1, is accurate. This lower bound will be necessary for the analysis in upcoming sections.

First we need to make the following assumptions:

8 Fault Localization in Bayesian Networks

Assumption 1 The BN graph contains a high number of nodes with many conditionally inde-pendent fragments. Hence, a high number of independent factors can be identified.

Assumption 2 The probability pre > 0.5, that any CPT supports an accurate classification ina given situation (Section 4 provides a rationale for this assumption).

Proposition 2 Given Assumptions 1 and 2 and sufficiently high branching factors, the proba-bility pf that the true state of a classification node will be reinforced by a factor is greater than0.5.

Proof (Sketch) Factor reinforcement is calculated recursively using Definition 4 and 5 begin-ning at the leaf nodes and ending at the classification node. We show that with each recursionloop Proposition 2 holds.

Let H be a classification node with k factors, φi be one of the factors associated with H andlet G be the child of H from the fragment corresponding to φi (see for example Figure 1). Wecan write pf for factor φi as:

pf = prepsum + α(1− pre)(1− psum), (13)

where psum is the probability that the reinforcement summary at node G equalled the true stateof G. If the reinforcement summary is accurate and the fragment between H and G is adequatethen the reinforcement at H is accurate. The second term represents the situation where thereinforcement summary at G is inaccurate and the fragment between G and H contains a fault.These can cancel each other out, which can result in an accurate reinforcement at H. 0 < α < 1is a scalar that represents the probability that such a situation occurs. Note that for binaryvariables α = 1.

Next, let pf be the minimum pf over all factors associated with H. From Definition 5 wecan give a lower bound on psum for node H:

psum ≥k∑

m=dk/2e

(k

m

)p m

f (1− pf )k−m. (14)

This is a lower bound because the reinforcement summary is defined as the state with themaximum reinforcement counter, which is less restrictive than the absolute majority (dk/2e)used in (14).

Assumption 2 states that pre > 0.5 and therefore (13) implies that there exists a sufficientlyhigh psum for which pf > 0.5. (14) implies that a sufficiently high psum can be obtained ifpf > 0.5 and k is sufficiently high. The recursion starts with the leaf nodes, for which psum = 1since they are instantiated.

Thus, if a network contains enough fragments (Assumption 1), and pre > 0.5 (Assumption2) then pf > 0.5 for all classification nodes. ¤

For the complete proof see [13]. Additionally, from the above analysis, we can observe thefollowing property:

Corollary 3 pf will increase and approach 1 if the branching factors increase.

6 Fault Monitoring

We want to estimate the adequacy of a particular model fragment for a particular case. It isclear that we cannot directly apply Definition 3, because we do not know the true state of hidden

Section 6 Fault Monitoring 9

variables and thus cannot evaluate Condition (6). We will show in this section however thatgiven certain (in)accuracies a model will ‘behave’ in a certain way. We will call this behaviormodel response, describe it in terms of the reinforcements from the previous section, and showthat it can give clues to the existence of inaccuracies.

6.1 Factor Consistency

Since the true state of a hidden variable is unknown, it is impossible to directly determinewhether or not RH(φi) = h∗ holds. We can however use the following definition whose conditionis directly observable and which describes the relationship between multiple factors:

Definition 6 (Factor Consistency) Given any node H, a set of factors ΦH is consistent iff

∀φi, φj ∈ ΦH RH(φi) = RH(φj).

The factors are thus consistent if they reinforce the same state of H. Given the fact that therecan be only one true state h∗ at a given moment, we observe that if each element of a set offactors ΦH satisfies the condition in Definition 3, then that set must be consistent.

If a set of factors is not consistent, then there exist elements from that set that do not satisfythe condition in Definition 3. Obviously, through various faults we will observe inconsistentfactor sets in most situations. In that case we should determine which of the factors in aninconsistent set violate Definition 3. We will next show how this can be achieved, using theresult from Proposition 2 and by introducing a consistency measure.

6.2 Consistency Measure

We define a measure for the degree of consistency of any factor φi with respect to the observedfactor reinforcements of all factors of a set ΦH .

Definition 7 (Consistency Measure) Given a node H, a set of factors ΦH , and a reinforce-ment counter ni for each state hi (see Section 5). The consistency measure for a factor φi ∈ ΦH

is defined as:CH(φi) = nj −max

k 6=jnk,

where hj is the state of H that was reinforced by factor φi.

In other words, the consistency measure for a factor φi is equal to the number of factors ‘agreeing’with φi (including φi itself), minus the maximum number of reinforcements any other state ofH got.

For the running example, where the reinforcements were RH(φ1) = RH(φ3) = h1 andRH(φ2) = h2, we get CH(φ1) = 1 and CH(φ2) = −1. Using this definition we can describecertain relations between value of the consistency measure and estimated factor accuracy.

6.3 Estimation of the Summary Accuracy

We use p = pf > 0.5 as the a priori probability that a reinforcement equals the true state,RH(φi) = h∗ (recall Proposition 2). Consider a node H = {h1, . . . , hm} and associated rein-forcement counters N = {n1, . . . , nm}. N is the sum over N , and thus the total number offactors. The conditional probability that any particular state hi equals the true state h∗, giventhat we observed the reinforcement set N and assuming uniform priors over h∗, can be expressedas

p(hi = h∗|N ) =pni(1− p)N−ni

∑j pnj (1− p)N−nj

.

10 Fault Localization in Bayesian Networks

The numerator consists of the probability of a correct reinforcement to the power of the numberof reinforcements supporting hi, times the probability that a reinforcement is inaccurate to thepower of the number of reinforcements not supporting hi. The denominator normalizes thedistribution.

We want to determine exactly for which degree of consistency CH this conditional probabilityis greater than 0.5. This is the case if

pni(1− p)N−ni >∑

j 6=i

pnj (1− p)N−nj .

Because of the sum term in the equation this is difficult to express in terms of CH . We takean upper bound of the right hand side of the inequality, and if this new inequality is satisfiedthe original is satisfied as well. The upper bound we use is: nmaxx ≥ ∑

x. If we definec = p/(1− p), then this becomes:

cni > (m− 1)cmaxj 6=i nj ,

which is equivalent to:

ni −maxj 6=i

nj >log(m− 1)

log c. (15)

The left hand side is now equal to the consistency measure CH .We also want to determine exactly when p(hi = h∗|N ) is smaller than 0.5. This is true if

pni(1− p)N−ni <∑

j 6=i

pnj (1− p)N−nj .

We now take a lower bound of the right hand side of the inequality, so if this new inequality issatisfied the original is satisfied as well. Thus, for the lower bound: maxx ≤ ∑

x,

cni < cmaxj 6=i nj ,

which is equivalent to:ni −max

j 6=inj < 0 (16)

and we derive the following implications:

CH < 0 ⇒ p(hi = h∗|N ) < 0.5 (17)

CH >log(m− 1)

log c⇒ p(hi = h∗|N ) > 0.5 (18)

CH here denotes CH(φ), hi = RH(φ) and m is the number of states of H. These implicationsgive the probability of an accurate factor reinforcement, given its consistency measure. Thisallows us to use an observable quantity (the reinforcement counters), in order to derive theprobability that a particular fragment is adequate. Thus, if a factor has a negative consistencymeasure, the corresponding fragment probably introduces a fault.

Implication (18) is not trivial to interpret, since the condition depends on the unknown factorc = p/(1 − p). It turns out however that without knowing the exact value of c we can oftenspecify an adequate CH , which makes implication (18) valid. It is important to note that theconsistency measure can only take integer values. For example, any value of log(m−1)/ log c < 1requires CH to be at least 1 in order to satisfy (18). Condition log(m− 1)/ log c < 1 is satisfiedif p ∈ 〈pmin, 1]. Table 2 shows the lower bound of interval 〈pmin, 1] for which different values ofCH are adequate. These bounds depend also on m. If m = 2, then CH = 1 is adequate for anyp ∈ 〈0.5, 1]. Recall that we already assumed p to be greater than 0.5.

Section 7 Fault Localization Algorithm 11

m = CH = 1 CH = 22 0.50 0.503 0.66 0.584 0.75 0.635 0.80 0.66

Table 2: Minimum value of p that is sufficient to satisfy (18) given a certain value of CH and m.

(a)

X

Y

YiF (b)

X

Y

YiF (c)

X

Y

FYi

Figure 2: (a) Network section. (b) Comparison at node Y . (c) Comparison at node X.

7 Fault Localization Algorithm

Depending on which CPTs from a network fragment FHi are inadequate in a given situation,

the resulting factor φi might be inaccurate as well. We can often localize inadequate CPTs byusing (17) and (18).

Consider a network section consisting of two adjacent nodes, X and Y (see Figure 2). Firstwe consider one particular fragment FY

i rooted in Y . At run-time, the consistency measureCY (φi) can be obtained at node Y for the factor corresponding to FY

i (see Figure 2b). Thismeasure combined with (17) and (18) indicates whether FY

i up until node Y is adequate.Let F ′ be fragment FY

i plus the edge 〈X, Y 〉. F ′ would be a fragment of X if we would removeall fragments of Y except for FY

i from the graph (see Figure 2c). Let φ′i be its correspondingfactor. We can observe the consistency measure CX(φ′i) at node X for fragment F ′. To computethis consistency, we need to know the reinforcement RX(φ′i). This can be obtained using thereinforcement propagation algorithm by ignoring the reinforcements from all fragments rooted inY , expect for FY

i . We then compare the reinforcement of φ′i on node X with the reinforcementsof all other factors of X, and obtain the consistency (see Figure 2c). Again, this gives anindication of the adequacy of the fragment FY

i , this time including edge 〈X, Y 〉.These two consistency measures combined indicate the adequacy of the CPT parameters

p(Y |X) corresponding to the edge 〈X,Y 〉. We use the following rule:

Rule 1 Let θt and θf be thresholds on the consistency measure. If, for any node X, we observeCX(φ) > θt then we assume x∗ = RX(φ). If we observe CX(φ) < θf then we assume x∗ 6=RX(φ).

Given this rule, we can determine the adequacy of the CPT parameters p(Y |X) based on thefollowing intuition: If a fragment is adequate up to Y , but the extended fragment inadequateup to X, then the fault lies with edge 〈X, Y 〉. All such localization rules are shown in Table 3.

In other words, we compare the consistency at two adjacent nodes and classify the edgebetween the nodes as adequate or inadequate. We can show that the use of Rule 1 in conjunctionwith appropriate thresholds will guarantee that in most cases the (in)adequacy of the CPTcorresponding to 〈X, Y 〉 is correctly determined.

Proposition 4 (Fault Localization) Given a network with binary nodes containing a suffi-cient number of fragments and pf > 0.5 (see Section 5.1), fault localization based on Rule 1 and

12 Fault Localization in Bayesian Networks

x∗ = RX(φ′i) y∗ = RY (φi) edge 〈X, Y 〉true true ⇒ oktrue false ⇒ inadequatefalse true ⇒ inadequatefalse false ⇒ ok

Table 3: Localization rules. The values in the first two columns correspond to the truth of theequality in the column header.

Table 3 with thresholds θt = log(m−1)log c and θf = 0 will correctly determine whether a particular

CPT p(Y |X) is adequate or inadequate with more than 50% chance.

Proof If for any node A and factor φ we observe CA(φ) > θt = log(m−1)log c then Rule 1 tells us

to assume that a∗ = RA(φ). (18) implies that, given pf > 0.5, the probability p∗A that thisassumption is correct, namely that a∗ truly equals RA(φ), is p∗A = p(RA(φ) = a∗|N ) > 0.5.

Analogously, if for any node A and factor φ we observe CA(φ) < 0 then Rule 1 tells usto assume that a∗ 6= RA(φ). (17) implies that, given pf > 0.5, the probability p∗A that thisassumption is correct, namely that a∗ truly does not equals RA(φ), is p∗A = 1 − p(RA(φ) =a∗|N ) > 0.5.

This holds for nodes X and Y from Table 3, and thus the probability that we will choose theright state of the first two columns of Table 3, and thereby draw the right conclusion about edge〈X, Y 〉, is p∗X · p∗Y , where both p∗X > 0.5 and p∗Y > 0.5 as shown above. If we wrongly choose thestate of both columns we will draw the same conclusion. The total probability of drawing thecorrect conclusion is therefore pcorrect = p∗X · p∗Y + (1 − p∗X) · (1 − p∗Y ). It is easy to see that ifp∗X > 0.5 and p∗Y > 0.5 then pcorrect > 0.5, and that therefore fragment classification using Rule1, Table 3 and the appropriate thresholds is correct with more than 50% chance. ¤

We can apply such analysis to all non-terminal nodes by running Algorithm 2.

Algorithm 2: Localization AlgorithmExecute Algorithm 1, and store all factor reinforcements;1

for each node X do2

for each fragment FXi of X do

Let Y be the child of X within FXi ;

for each fragment FYj of Y do

Compute CX(φ′) and CY (φ) for FYj ;

Using thresholds θp and θf , and Table 3, classify CPT p(Y |X);endUse majority voting on all classifications of CPT p(Y |X) based on different FY

j ;end

end

We observe the following property of Algorithm 2:

Corollary 5 The majority voting at the end of Algorithm 2 improves with higher branchingfactors. Higher branching factors imply more votes about the state of a fragment and there-fore higher expected localization accuracy. This accuracy converges asymptotically to 1 if thebranching factors increase.

Note that while the proof is given for networks with binary nodes, the algorithm is likelyto be effective for multi-state nodes as well. In that case, the implications on the second and

Section 8 Experiments 13

fourth line of Table 3 are not necessarily valid. For example, there are rare circumstances wherex∗ = RX(φ′i) and y∗ 6= RY (φi), but where the CPT is nonetheless adequate. This is possiblebecause multiple states of Y (including those not equal to y∗) could all reinforce the same truestate x∗. See for example Table 1, where both b2 and b3 reinforce h2. If h∗ = h2 and b∗ = b2,but b3 were instantiated, then the CPT would be deemed inadequate while it was in fact notintroducing a fault. If these circumstances do not occur often, then the majority voting inAlgorithm 2 will mitigate their effects, and the localization will work correctly, especially if thebranching factors are high. The experiments in Section 8.1 support this.

7.1 Determining the Cause Type

We can distinguish between rare cases (type 1) and model errors (type 2) by their frequency ofoccurrence. For this we need to perform fault localization on a BN for a set of cases. If certainfragments are diagnosed as inadequate for a large number of cases, then this is an indicationthat the fragment might contain erroneous parameters.

Alternatively, it might be possible to find model errors by localizing faults on a case fromthe domain which one knows is not rare. In other words, a case for which we know that the truestate of every node is the most likely state given the evidence (see Section 4.1, cause 1). Thisexcludes the possibility for faults due to a rare case. Any found faults are then probably causedby model errors.

8 Experiments

To verify our claims and illustrate some of the properties of the algorithm 2 we applied it toseveral synthetic networks, in which we artificially introduced faults. We also applied it to a realnetwork, which we adapted such that it represents an oversimplification of the problem domain,thus introducing faults.

8.1 Synthetic Networks

We generated BNs with random CPTs, using a simple tree DAG with fixed branching factor kand 4 levels. We initialized all CPTs such that the probability pre of a CPT being adequate(see Section 4) could be controlled. We let pre take values 1, 0.95, . . . , 0.4. Then we generated1000 data samples for each particular network, applied algorithm 2 on each sample case, andobserved its output. We used 0 as the positive threshold θt, which meant that the consistencymeasure had to be at least 1 for assuming that a CPT is adequate. Even though the algorithmdoes not know the value of c in (18), it turned out that algorithm 2 is quite insensitive to theprecise positive threshold value, which confirms the rationale at the end of Section 6.2.

The algorithm’s output was compared with the ground truth, i.e. which fragments reallywere inadequate for the given data case. This ground truth can be obtained from the completecase, which was known to us. Given the inadequate CPTs that were present for a given case, werecorded the percentage of CPTs that the algorithm could detect and the percentage of detectedinadequacies that turned out to be false positives.

We applied the algorithm to networks with varying branching factors (but the same generalstructure). The percentages are plotted in Figure 3. For Figure 4 we varied the number ofstates per network variable. Figure 3 confirms the analysis that for any value of pre > 0.5,higher branching factors increase the algorithm’s effectiveness.

Figure 4 also shows that the algorithm performs better on networks with more node states,which can be explained by the fact that in such cases inadequate sample values are spreadover more states. For example, suppose that in a certain situation a node is in state 1, but

14 Fault Localization in Bayesian Networks

3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

branching factor

perc

enta

ge

Figure 3: The effect of branching factors on a network with 4-state nodes, for different valuesof pre: 0.9 (dash-dotted), 0.7 (solid), 0.5 (dashed). Top curves show percentage found, bottomcurves show percentage of false positives.

1 0.9 0.8 0.7 0.6 0.5 0.40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

pre

perc

enta

ge

Figure 4: The effect of the number of node states on a network with branching factor 5, fordifferent values of pre (horizontal axis). Number of states: 2 (dashed), 3 (solid), 4 (dash-dotted). Top curves show percentage found, bottom curves show percentage of false positives.The dotted line shows the worst case scenario for 3 and 4 states.

Section 9 Applications 15

1

2 3 4 5 6 7 8 9 10 11 12

13 14 15 16 17 18 19 20 21 22 23 24 25 26

27 28 29

Figure 5: Network structure used for the experiment with the real network.

an inadequate fragment has caused a higher belief in a different state. If a node has morestates, inaccurate classifications will be spread among more alternatives. Thus, on average, thedifference between the counter of the correct state and the other counters increases, making thecorrect state still stand out. For example given N = {3, 2, 0}, state 1 would have a consistencymeasure of 1, while for N = {3, 1, 1} it would be 2. Note that the degree of this spread alsoinfluences the quality, as can be seen from the dotted line in Figure 4. This line shows theeffectiveness if we enforce only one alternative state (i.e. if a fragment is inadequate it willalways cause the same inaccurate state), which on average decreases the consistency measure.The worst case scenario is equivalent to localization in binary BNs. We expect real networks tobe somewhere in between this worst and best case scenario.

8.2 Real World Experiment

Next, we tested the algorithm on a real network, namely a subtree of the Munin medical diagnosisnetwork [1] (see Figure 5 for the subtree structure). This tree BN is a significant simplificationof the problem domain. It was constructed by first manually setting the (simple) networkstructure and then using the EM algorithm [6] to learn the parameters from a data set sampledfrom the complete Munin network. Obviously, when we would attempt to classify cases usingthis simple BN, misclassifications will occur. The question is whether our algorithm can detectthese misclassifications and localize their causes.

We applied the algorithm on the tree BN for a set of sample cases generated by the completenetwork. Since the state of all (hidden) variables in all cases was known, we knew which CPTswere inadequate. On the tree network, the algorithm found 75.7% of all inadequate CPTs,while producing 20.9% false positives, which confirms that the algorithm can be effective in areal world setting.

9 Applications

In Section 4.1 we identified three types of causes for classification faults. The presented approachto localization can be used to detect inadequate CPTs and mitigate their impact.

9.1 Localizing Faulty Model Components

The localization algorithm can discover faults of Type 2, where a CPT does not accuratelycapture the general tendencies in the modeled domain (see (9)). By applying the localizationalgorithm to many different samples obtained in different situations, we can localize CPTs whichare found to be inadequate in the majority of the samples. Such CPTs represent modeling errors,but cannot be avoided if the model is used in changing domains and the learning examples orexpertise used for generation of the model do not capture the characteristics of the new domain.Fault localization can be especially useful in domains which change sufficiently slowly, allowingus to discover local inadequacies and adapt the model gradually to the new domain.

16 Fault Localization in Bayesian Networks

9.2 Deactivating Inadequate Model Components

In the case of faults of Type 1, we can use the localization algorithm 2 to localize CPTs thatare inadequate in a particular situation corresponding to a certain set of observations. A CPTconsidered inadequate can be set to a uniform distribution, which effectively renders the fragmentconnected to the rest of the network via this CPT inactive. Since a fragment related to therest of the network via an inadequate CPT does not support accurate classification in a givensituation, its deactivation at runtime can improve the overall inference accuracy. In principle,by deactivating an inadequate CPT the divergence between the the estimated distribution overthe hypothesis variable and the true point-mass distribution can be reduced. This is useful if theclassification considers decision thresholds that are greater than 0.5. If for a given observationset the estimated distribution does not approach the true point mass distribution sufficientlyenough, then the case cannot be classified. By deactivating a fragment the percentage of suchcases can be reduced without any loss of performance.

Since the fault localization algorithm can fail, occasionally adequate CPTs could be con-sidered inadequate, which can reduce the classification quality. However, by considering theproperties of the localization algorithm, we can show that it is more likely to encounter cases(i.e. sets of observations), for which the classification quality improves. This is especially thecase if fragments rooted in the hypothesis node have identical topologies and CPTs, which cor-responds to models of conditionally independent processes of the same type running in parallel.

Models that support improved classification through fragment deactivation are relevant fora significant class of applications, where states of hidden variables are inferred through inter-pretation (i.e. fusion) of information obtained from large amounts of different sources, such assensors. As it was shown in [13], such fusion can be based on BNs where each sensor is associatedwith a conditionally independent fragment given the monitored phenomenon.

The improvement of the estimation through deactivation of fragments is illustrated with thehelp of an experiment. We used a BN with a tree topology, branching factor 5 and 4 levelsof nodes, corresponding to 125 leaf nodes. The CPTs at every level were identical, such thatpre = 0.75. This network was used for the data generation through sampling. The sampled datasets, consisting of 5000 cases, were fed to two classifiers. Both classifiers were based on the BNidentical to the generative model. For one classifier we used the fault localization and deactivatedinadequate fragments. Compared to the classifier using unaltered BN, the average posteriorprobability of the true hypothesis was significantly higher (0.81 instead of 0.76). Furthermore,the divergence between the estimated and the true distribution over the hypothesis variable wasreduced for 67% of the data cases. Finally, of the cases that were misclassified by the unalteredBN, 11% got correctly classified after deactivation of inadequate CPTs. In contrast, a correctclassification became a misclassification after deactivation in only 2% of cases.

Furthermore, we assume that a sensor failure is a rare event. Consequently, if a sensor isbroken, the CPT relating the monitored phenomenon and the fragment corresponding to thesensor is inadequate. If a few of the existing sensors are broken, we can localize the correspondingCPTs and mitigate the impact of broken sensors by deactivating the corresponding networkfragments.

10 Discussion

10.1 Non-Tree Networks

The analyses in the sections above are based on tree structured DAGs. For example, pre denotesthe probability that a single CPT, corresponding to a single edge in the graph, is accurate.Obviously, real domains are often not represented by pure trees. It is possible, however, toconvert an arbitrary DAG to a tree structure by compounding multiple nodes into hyper nodes

Section 10 Discussion 17

(a) /.-,()*+A

ÀÀ;;;

;¢¢¥¥

¥¥xxrrrrrrr

76540123B

¢¢¤¤¤¤

76540123C²²

76540123D

¢¢¥¥¥¥

76540123H

¢¢¤¤¤¤

ÀÀ;;;

;

²²

76540123E²²

76540123F 76540123G

¢¢¤¤¤¤

¿¿999

9

/.-,()*+I /.-,()*+J /.-,()*+L

(b) /.-,()*+A

ÀÀ;;;

;££¥¥

¥¥76540123B

ÄÄÄÄÄÄ

76540123E

¤¤§§§§

76540123H

¢¢¤¤¤¤

ÂÂ???

?

/.-,()*+I 76540123F ?>=<89:;JL

Figure 6: Example network structures: (a) original DAG, (b) tree DAG.

and marginalizing out certain nodes. The states of the hyper nodes are the cartesian productsof the states of the original nodes. Note that this will increase the size of the CPTs and thusthe assumption of pre > 0.5 becomes more difficult to justify.

We give a simple example to illustrate this claim. Suppose the structure of our model is theDAG shown in Figure 6(a), where all leaf nodes are the evidence nodes. Given node A, nodes Cand D are not independent and therefore must be part of the same fragment, together with E.Since we want fragments to consist of only one node, C and D either have to be compounded ormarginalized out. To avoid unnecessarily large CPTs, we choose marginalization. Now all thefragments to the left of A consist of only one node. On the right side of the DAG, given H, F isan independent fragment. The fragment consisting of G, J and L cannot be split into multiplefragments given H, and it is not tree structured. It can be converted to a tree by compoundingJ and L and marginalizing out G. The resulting DAG is shown in Figure 6(b). Note that thestructure of this DAG is equal to that of the running example (see Figure 1).

10.2 Related Work

Several authors have addressed the problems with reliable inference and modeling robustness.Sensitivity based approaches focus on the determination of modeling components that have asignificant impact on the inference process [3, 2]. We must take special care of such components,since eventual modeling faults will have a great impact as well. Sensitivity analysis is carriedout prior to the operation and deals with the accuracy of the generalizations.

Another class of approaches, including ours, is focusing on determination of the model qualityor performance in a given situation at runtime. The central idea of our approach is observationof the consistency of the model’s runtime reinforcements, which is different from common ap-proaches to runtime analysis of BNs, as for example, data conflict [8] and straw models [10, 9].The data conflict approach is based on the assumption that given an adequate model all obser-vations should be correlated and p(e1, . . . , en) > p(e1) · · · p(en). If this inequality is not satisfiedthen this is an indication that the model does not ‘fit’ the current set of observations [7]. A gen-eralization of this method, [9], is based on the use of straw models. Simpler (straw) models areconstructed through partial marginalization, which, in a coherent situation, should be less prob-able than the original model. Situations in which the evidence observations are very unlikelyunder the original model and more probable under the straw model indicate a data conflict.While these approaches can handle more general BNs than our method, their disadvantage isthat the conflict scores are difficult to interpret; at which score should an action be undertaken,or what is the probability that a positive score indicates an error?

Another approach proposed in [5] is the surprise index of an evidence set. This indexis defined as the joint probability of the evidence set plus the sum of the probabilities of allpossible evidence sets that are less probable. If the index has a value below a certain threshold([5] proposes a threshold of 0.1 or lower), the evidence set is deemed surprising, indicating apossibly erroneous model. Clearly, this approach requires computing the probabilities of an

18 REFERENCES

exponentially large number of possible evidence sets, making it intractable for most models.In addition, most of the common approaches to model checking focus on the net-performance

of models and do not directly support detection of inaccurate parts of a model [16]. An exceptionis the approach [4] based on logarithmic penalty scoring rules. However, in this case the scorescan be determined only for the nodes corresponding to observable events, while we reason aboutthe nodes modeling hidden events.

10.3 Conclusion

We have presented an approach to fault detection in BNs that are used for classification. Thiswas done through the following steps:

1. We identified a partitioning of a BN such that each fragment has an independent influenceon a classification node.

2. We identified three different fault causes which can be present in a CPT, and argued that0.5 is a plausible lower bound on the probability of encountering such a fault cause.

3. We presented a coarse view on the inference process and showed how faults can be prop-agated through different network fragments.

4. We introduced a measure to monitor the consistency among the influences of multiplenetwork fragments on a node, and showed that we can find thresholds on this measuresuch that we can deduce the probability of a fault existing in a fragment.

5. We presented an algorithm that can combine the consistency measures at different nodesin the network in order to determine whether the fragment between the nodes contains afault.

One might question the assumption about large branching factors. However, there existapplications where this is the case, as for example the Distributed Perception Networks [12],which deals with hundreds or thousands of observations, each corresponding to a fragment in aBN.

We showed that the results from fault localization can be used in several ways, such as local-ization of erroneous modeling parameters, faulty information sources and modeling componentsthat do not support accurate inference in a particular situation due to rare cases. Furthermore,we established a lower bound to the algorithm’s effectiveness, which we showed to convergeasymptotically to 1 for network topologies with increasing branching factors.

References

[1] S. Andreassen, F. V. Jensen, S. K. Andersen, B. Falck, U. Kjærulff, M. Woldbye, A. R.Sørensen, A. Rosenfalck, and F. Jensen. MUNIN — an expert EMG assistant. In Computer-Aided Electromyography and Expert Systems, chapter 21. Elsevier Science Publishers, Am-sterdam, 1989.

[2] E. Castillo, J. M. Gutierrez, and A. S. Hadi. Sensitivity analysis in discrete BayesianNetworks. IEEE Transactions on Systems, Man, and Cybernetics. Part A: Systems andHumans, 27:412–423, 1997.

[3] V. M. H. Coupe and L. C. van der Gaag. Practicable sensitivity analysis of bayesian beliefnetworks. In Joint Session of the 6th Prague Symposium of Asymptotic Statistics and the13th Prague Conference on Information Theory, Statistical Decision Functions and RandomProcesses, pages 81–86, Prague, 1998.

REFERENCES 19

[4] R. G. Cowell, A. P. Dawid, and D. J. Spiegelhalter. Sequential model criticism in proba-bilistic expert systems. IEEE Transactions on Pattern Analysis and Machine Intelligence,15(3):209–219, 1993.

[5] J. D. F. Habbema. Models for diagnosis and detection of combinations of diseases. InF. de Dombal et al., editor, In Proc. IFIP Conf. on Decision Making and Medical Care,pages 399–411, 1976.

[6] D. Heckerman. A tutorial on learning with Bayesian networks. In M. Jordan, editor,Learning in Graphical Models. MIT Press, Cambridge, MA, 1999.

[7] F. V. Jensen. Bayesian Networks and Decision Graphs. Springer-Verlag, New York, 2001.

[8] F. V. Jensen, B. Chamberlain, T. Nordahl, and F. Jensen. Analysis in hugin of data conflict.In In Proc. Sixth International Conference on Uncertainty in Artificial Intelligence, pages519–528, 1990.

[9] Y.-G. Kim and M. Valtorta. On the detection of conflicts in diagnostic bayesian networksusing abstraction. In Proceedings of the Eleventh International Conference on Uncertaintyin Artificial Intelligence, pages 362–367, 1995.

[10] K. Laskey. Conflict and surprise: Heuristics for model revision. In In Proc. Seventh Inter-national Conference on Uncertainty in Artificial Intelligence, pages 197–204, 1991.

[11] David J. C. MacKay. Information Theory, Inference, and Learning Algorithms. CambridgeUniversity Press, 2003. Available from http://www.inference.phy.cam.ac.uk/mackay/itila/.

[12] G. Pavlin, M. Maris, and J. Nunnink. An agent-based approach to distributed data andinformation fusion. In Proc. IEEE/WIC/ACM Joint Conference on Intelligent Agent Tech-nology, pages 466–470, 2004.

[13] G. Pavlin and J. Nunnink. Inference meta models: A new perspective on inference withbayesian networks. Technical Report IAS-UVA-06-01, Informatics Institute, University ofAmsterdam, The Netherlands, 2006.

[14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.Morgan Kaufmann, 1988.

[15] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.

[16] L. C. van der Gaag and S. Renooij. Evaluation scores for probabilistic networks. In Proceed-ings of the 13th Belgium-Netherlands Conference on Artificial Intelligence, pages 109–116,2001.

20 REFERENCES

Acknowledgements

IAS reports

This report is in the series of IAS technical reports. The series editor is BasTerwijn ([email protected]). Within this series the following titlesappeared:

F. Oliehoek and N. Vlassis Dec-pomdps and extensive form games: equiva-lence of models and algorithms. Technical Report IAS-UVA-06-02, InformaticsInstitute, University of Amsterdam, The Netherlands, April 2006.

G. Pavlin and J. Nunnink and F. Groen Inference meta models: A newperspective on belief propagation with bayesian networks. Technical Report IAS-UVA-06-01, Informatics Institute, University of Amsterdam, The Netherlands,March 2006.

Z. Zivkovic and O. Booij How did we built our hyperbolic mirror omni-directionalcamera - practical issues and basic geometry. Technical Report IAS-UVA-05-04, Informatics Institute, University of Amsterdam, The Netherlands, December2005.

All IAS technical reports are available for download at the IAS website, http://www.science.uva.nl/research/ias/publications/reports/.