Download - Model Similarity and Rank-Order Based Classification of Bayesian Networks

Transcript

Journal of Classification 30:428-452 (2013)

Model Similarity and Rank-Order Based Classificationof Bayesian Networks

Sung-Ho Kim

Korea Advanced Institute of Science and Technology, South Korea

Geonyoup Noh

Korea Advanced Institute of Science and Technology, South Korea

Abstract: Suppose that we rank-order the conditional probabilities for a group ofsubjects that are provided from a Bayesian network (BN) model of binary variables.The conditional probability is the probability that a subject has a certain attributegiven an outcome of some other variables and the classification is based on the rank-order. Under the condition that the class sizes are equal across the class levels and thatall the variables in the model are positively associated with each other, we comparedthe classification results between models of binary variables which share the samemodel structure. In the comparison, we used a BN model, called a similar BN model,which was constructed under some rule based on a set of BN models satisfying certainconditions. Simulation results indicate that the agreement level of the classificationbetween a set of BN models and their corresponding similar BN model is considerablyhigh with the exact agreement for about half of the subjects or more and the agreementup to one-class-level difference for about 90% or more.

Keywords: Agreement level; Bayesian network; Conditional probability; Model sim-ilarity; Positive association.

The authors are grateful to the editor and the three referees for their careful readingof the paper and for their insightful comments and suggestions on it. Our special thanks goto a referee whose suggestion led us to derive the two theorems in Section 3. This work wassupported by a grant from Korea Research Foundation Grant (KRF-2010-0023739).

Authors’ Addresses: S-H. Kim, Department of Mathematical Sciences, KAIST, Dae-jeon, 305-701, S. Korea, e-mail: [email protected]; G. Noh, Department of Mathema-tical Sciences, KAIST, Daejeon, 305-701, S. Korea, e-mail: [email protected].

Published online: 20 October 2013

DOI: 10.1007/s00357-013-9140-9

1. Introduction and Problem

Consider a problem of rank-ordering or arranging in the order of ranka group of subjects according to the conditional probability that a subjectpossesses a certain attribute given the outcome of a set of covariates. If weneed to classify them into one of L classes according to the rank-order, itbecomes a classification problem. This is an example of discretizing contin-uous values by rank-ordering, which takes place in a variety of survey andperformance assessment in medicine, education, sports, finance, etc. We of-ten express our opinions or feelings using the rank-order based categorizedterms such as ‘very little,’ ‘little,’ ‘medium,’ ‘much.’

The rank-orders may be discrepant between raters due to differentstandards of thresholds among others. We will consider a rank-order basedclassification problem using the probability values which are produced froma Bayesian network (BN) model (Pearl 1988; Jensen 1996). The BN modelis one of the most popular decision aid tools for nominal or finitely discreterandom variables.

Literature abounds on agreement measures between raters or classi-fiers (Cohen 1960; Spitzer, Cohen, Fleiss, and Endicott 1967; Fleiss 1981;Messatfa 1992; Pires and Branco 1997; Albatineh, Niewiadomska-Bugaj,and Mihalko 2006; Brusco and Steinley 2008), where the agreement mea-sure accounts for the level of agreement relative to the level of agreementby chance. The value of the measure is between 0 and 1, 1 indicating a per-fect agreement between raters. This kind of agreement measures are veryuseful when we compare raters in a relative sense. In this paper, we are in-terested in a detailed description of agreement such as proportions of perfectagreement, agreement up to one-class-level difference, etc.

BNs are useful in representing graphically the relationships amongthe variables which are interpretable in generic terms as causal. In this pa-per, we will consider BN models of binary variables each taking on 0 or 1.When all the variables in the model are categorical with finite number of cat-egory levels, we can compute conditional probabilities of a set of variablesgiven another set of variables in the model by applying the method calledevidence propagation (Lauritzen and Spiegelhalter 1988). In this method,we compute conditional probabilities in a certain order of random variableswhich is determined by the model structure of a BN model starting from aset of conditional variables toward each of interested variables. At each stepof the computing process, we compute the conditional probability of a set ofvariables given another set of variables in accordance with the given orderof random variables.

Suppose that we rank-order a group of subjects according to the con-ditional probabilities that are provided from a BN model. The conditional

Bayesian Networks 429

probability is given in the form of P (U = 1|X = x) where X is a randomvector. If we classify the subjects in such a way that the class sizes are equalacross the classes, the classification is analogous to assigning rank-order-based grade scores to students. If two BN models provide the conditionalprobabilities for the same group of subjects and the rank-orders of the prob-ability values are similar to each other between the two models, then theagreement level between the two classification results will be very high. Wewill elaborate on the agreement level between the BN models which satisfysome conditions as described in Sections 4 and 5.

We will introduce a notion of similarity between BN models whosemodel structures are the same. The only difference between them is inthe marginal probability or the conditional probability of a variable, X say,given the variables at its parent nodes, i.e., the nodes at the arrow tails of X,in the BN (see Figure 1). Assuming that the variables in a model are pos-itively associated (Holland and Rosenbaum 1986; Junker and Ellis 1997)with each other, we will investigate agreement levels in classification be-tween a pair of BN models, where one model is a similar model of the otherwhich will be explained in detail in Sections 3 and 4. The investigation willbe carried out by defining the similarity measure between BN models andsuggesting, under some conditions, a construction method of a best similarmodel with regard to the similarity measure. As for a BN model, the posi-tive association between variables can be interpreted as a positively causalrelationship. In cognitive science, we assume that knowledge states and taskperformances are positively causally related (Mislevy 1994) and we can findmany such examples in biological, medical, and behavioral sciences amongothers.

When a BN model contains unobservable variables, we usually usean EM method (Dempster, Laird, and Rubin 1977) for estimating the condi-tional probabilities of the variables in the BN model. If many variables areinvolved in the BN model, then the parameter estimation is time-consumingand the estimates tend to be contaminated with larger errors than when themodel is of a moderate size (Kim 2002). In this respect, when the BN modelis of a large size and involves unobservable binary variables and the classifi-cation is given in terms of rank-order based class levels of probabilities, thesimilar BN model could be recommended as a reasonable alternative. Thesimilar model would also be a reasonable alternative within some limit ofclassification tolerance even when no data are observed or when classifica-tion is to be made on a real time basis, provided the family of distributionsfor the data is known. This motivates us towards the work of this paper.

This paper is organized in 6 sections. Section 2 gives a brief review onpositive association among binary variables. Section 3 then introduces thenotion of similarity between BN models and proposes a BN model which is

S-H. Kim and G. Noh430

2

U1

U3

X1X3

X6X5

X4

X2

U

Figure 1. A BN where U variables are unobservable and X variables observable.

most similar to a given BN model under some condition. In Section 4, wedescribe a method of computing agreement levels of classifications betweena pair of BN models and then results of a simulation experiment are pre-sented. The simulation experiment extends in Section 5 to the case wherethe comparison is made between a set of BN models and a BN model whichsatisfy certain conditions. Section 6 concludes the paper with some summa-rizing remarks.

2. A Brief Review on Positive Association

Intuitively speaking, we say that two variables are positively associ-ated when increment in one of the variables brings about increment in theother variable with a higher probability than decrement. This phenomenon iscommonplace in a variety of research fields including education and medi-cine. If a student knows more than another in a subject domain of a test,then he or she has a higher probability of a perfect score than the latter. If anemployee of a company is healthier than another employee in every aspectof biological measurements, he or she has a higher probability of survival atthe age of 80 than the latter. We can think of many other examples of thiskind where each variable is positively associated with others.

In this section, we will review the notion of positive association withsome basic results that are instrumental for the notion.

Let U and X be binary variables, taking on the values 0 or 1. If

0 < P (U = 1) < 1 and 0 < P (X = 1) < 1,

then the following two inequalities are equivalent (Kim 2005):

P (X = 1|U = 0) < P (X = 1|U = 1) (1)

P (U = 1|X = 0) < P (U = 1|X = 1). (2)

Bayesian Networks 431

Expression (1) can be re-expressed as

P (X = 1|U = 1)P (X = 0|U = 0)

P (X = 0|U = 1)P (X = 1|U = 0)> 1,

which means that U and X are positively associated.For a pair of n-vectors u and v of the same length, we write u � v

when ui ≤ vi for i = 1, · · · , n, and write u ≺ v if u � v and ui < vifor at least one i = 1, · · · , n. Inequalities (1) and (2) can be extended,as in inequalities (3) and (4) below respectively, to random vectors, X =(X1, · · · ,XI) and U = (U1, · · · , UK), where all the Xi’s and Uk’s arebinary, taking on 0 or 1. For i = 1, 2, · · · , I,

P (Xi = 1|u) < P(Xi = 1|v), when u ≺ v. (3)

For k = 1, 2, · · · ,K,

P (Uk = 1|x) < P(Uk = 1|y), when x ≺ y. (4)

The strict inequality(<) in inequalities (3) and (4) may be replaced by theplain inequality (≤). Equivalence of (3) and (4) is shown in Kim (2005).

Inequality (3) is equivalent to that

P (v|Xi = 1)

P (u|Xi = 1)>

P (v)

P (u)>

P (v|Xi = 0)

P (u|Xi = 0).

This inequality says that, for u and v, u ≺ v, U = v is more likely thanU = u when Xi = 1 than when Xi = 0. Furthermore, we can compare thelikeliness of Uk = 1 for individual Uk’s as in (4).

When the < and ≺ in expression (3) are replaced by ≤ and �, respec-tively, we have

P (Xi = 1|u) ≤ P(Xi = 1|v), when u � v. (5)

This expression actually means that the conditional distribution of Xi givenU = v is stochastically larger than that of Xi given U = u. Holland andRosenbaum (1986) discuss properties concerning positive association amongthe X variables when the X variables are conditionally stochastically or-dered given U. Junker and Ellis (1997) characterize such X variables inmore generic terms such as conditional association and vanishing condi-tional dependence. We will call expression (5) the condition of positiveassociation between the sets, {X1, · · · ,XI} and {U1, · · · , UK}.

3. Similarity Between BN Models

Let a graph be denoted by G = (V, E) where V and E are, re-spectively, the set of nodes and the set of arrows between nodes. If two

S-H. Kim and G. Noh432

nodes u and v are connected by an arrow from u to v, the arrow is repre-sented by (u, v). So, (v, u) cannot be in E if (u, v) ∈ E. Such a graph iscalled a directed acyclic graph (DAG) when there is no sequence of arrows,(ui, ui+1), i = 1, 2, · · · , n, such that u1 = un+1. The model structure of aBN model is a DAG. As for a DAG G, we define a set pa(v) for a node v, aspa(v) = {u; (u, v) ∈ E}, and a set fa(v) as fa(v) = pa(v) ∪ {v}.

Consider two BN models which are the same in model structure butnot necessarily in distribution. For two BN models, BN1 and BN2, of XV ,assume that they have the same model structure, say G, and denote theirprobability distributions by D1 and D2 respectively. In other words, the BNmodels, BNm, m = 1, 2, are defined as BNm = (G,Dm). The probabilityfunction Pm of Dm, m = 1, 2, is expressed as

Pm(xV ) =∏v∈V

Pmv|pa(v)(xv |xpa(v)).

We can compare the two BN models, BN1 and BN2, by using theKullback-Leibler(KL) divergence measure, KL(BN1, BN2), defined by

KL(BN1, BN2) =∑xV

P 1(xV ) logP 1(xV )

P 2(xV )

which is non-negative and is equal to zero when and only when P 1 =P 2. As the two joint probabilities, P 1 and P 2, get closer to each other,KL(BN1, BN2) decreases down to zero. In this sense, we may use the KLdivergence measure for a measure of similarity by an appropriate transfor-mation of it such as exp(−KL(BN1, BN2)) or 1/(1 + KL(BN1, BN2))so that its values lie in the interval [0,1].

We denote the support of Xv by Xv and let, for A ⊆ V , XA =×v∈AXv where “×” stands for the Cartesian product. ForXv, we denote byδv(x) the number of 1’s in the vector of x ∈ Xpa(v). Suppose that the jointprobability function P 1 is given while P 2 is not known and that we wantto find the P 2 for which KL(BN1, BN2) is minimized under the conditionthat

P 2v|pa(v)(xv|y) = P 2

v|pa(v)(xv|z) for all y, z ∈ Xpa(v)

such that δv(y) = δv(z). (6)

Under this condition, we will define a function gv as

gv(xv, δv(y)) = P 2v|pa(v)(xv|y), y ∈ Xpa(v).

In the expression of gv , we need the subscript v for δ since nodes in a BNmay have different numbers of parent nodes.

Bayesian Networks 433

Theorem 1. Suppose that P 1 is given for BN1 and that P 2 is not knownwhile P 2

v|pa(v)’s satisfy condition (6) for BN2. Then KL(BN1, BN2) isminimized when, for every l ∈ {0, 1, · · · , |pa(v)|},

gv(xv, l) = P 1(Xv = xv|δv(X) = l) (7)

where P 1(δv(X) = l) is the probability that δv(X) = l under the probabil-ity model P 1.

Proof: We may consider a more general case that Xv = {0, 1, 2, · · · , Iv} inthe proof, although all the variables are assumed binary in the paper.

KL(BN1, BN2)

=∑

xV ∈XV

P 1(xV ) logP 1(xV )

P 2(xV )

=∑

xV ∈XV

P 1(xV )∑v∈V

logP 1v|pa(v)(xv |xpa(v))

gv(xv, δv(x))

=∑v

∑xV

P 1(xV ) logP 1v|pa(v)(xv |xpa(v))

gv(xv, δv(x))

=∑v

∑xfa(v)

P 1(xfa(v)) logP 1v|pa(v)(xv|xpa(v))

gv(xv, δv(x))

=∑v

|pa(v)|∑k=0

∑xfa(v);δv(xpa(v))=k

P 1(xfa(v)) logP 1v|pa(v)(xv |xpa(v))

gv(xv , k)

Differentiating the KL(BN1, BN2) with respect to gv(xv , l) for xv and l,xv ∈ Xv \ Iv and 0 ≤ l ≤ |pa(v)|, yields

dKL(BN1, BN2)

dgv(xv , l)=

∑xpa(v);δv(xpa(v))=l

[−P 1(xv,xpa(v))

gv(xv , l)+

P 1(Iv ,xpa(v))

gv(Iv, l)

].

(8)After a little bit of algebra on dKL(BN1,BN2)

dgv(xv ,l)= 0, we obtain its solution,

gv(xv, l), asgv(xv, l) = P 1(Xv = xv|δv(X) = l).

We can easily check, from (8), that KL(BN1, BN2) is minimized whengv(xv, l) = gv(xv, l), l ∈ {0, 1, · · · , |pa(v)|}.�

S-H. Kim and G. Noh434

We can think of a probability distribution for P 1v|pa(v)(1|y), y ∈

Xpa(v), regarding P 1v|pa(v)(1|y) itself as a random variable, under the condi-

tions that the positive association condition (5) is satisfied among Xv, v ∈V and that random variable P 1

v|pa(v)’s are independent for different v’s inV . We will call the latter condition variation-independence condition. Wewill denote the joint distribution of {P 1

v|pa(v)(1|y), y ∈ Xpa(v)} by πv1

where P 1v|pa(v)’s satisfy the positive association condition (5) and let π1 =

{πv1, v ∈ V }. Under the condition (5),

P 1v|pa(v)(1|y) ≤ P 1

v|pa(v)(1|z) if y � z. (9)

And so, the joint probability of (P 1v|pa(v)(1|y), y ∈ Xpa(v)) is zero when

(9) is violated. For each l (0 ≤ l ≤ |pa(v)|), we assume a conditionon π1 that all the elements in the set Γv(l) = {P 1

v|pa(v)(xv|y) : y ∈Xpa(v) such that δv(y) = l} are exchangeable (de Finetti 1972), i.e., all thesubsets of Γv(l) of the same size have the same distribution. We will callthis condition exchangeability condition.

Example 1. We will show a simple example of πv1 for an Xv for which

|pa(v)| = 2. In this case, Xpa(v) = {(0, 0), (0, 1), (1, 0), (1, 1)}. LetR1, R2, R3, R4 be a random sample from the uniform distribution U(0, 1)and R(i) be the ith smallest of the sample. Assign the R(i)’s to P 1

v|pa(v) sothat the positive association condition (5) is satisfied as follows:

P 1v|pa(v)(1|(0, 0)) = R(1), P

1v|pa(v)(1|(0, 1)) = R(2),

P 1v|pa(v)(1|(1, 0)) = R(3), P

1v|pa(v)(1|(1, 1)) = R(4)

orP 1v|pa(v)(1|(0, 0)) = R(1), P

1v|pa(v)(1|(0, 1)) = R(3),

P 1v|pa(v)(1|(1, 0)) = R(2, P

1v|pa(v)(1|(1, 1)) = R(4).

The two assignments must be made equally likely. The probability distribu-tion of the P 1

v|pa(v)(1|y), y ∈ Xpa(v), is denoted by πv1. As for the distribu-

tion of Ri’s, we may consider other distributions than the U(0, 1) includingBeta distributions.

We will call the distribution such as the U(0, 1) in this example abase distribution of πv

1. Once its base distribution is given, πv1 is defined in

terms of the order statistics of a random sample from the base distribution.The variation-independence condition means that P 1

v|pa(v) is independent of

{P 1w|pa(w), w ∈ V \ {v}} or {P 1

w, w ∈ V \ {v}}. The outcome of Xv is

Bayesian Networks 435

influenced by the value of Xpa(v) in a BN model but the probability P 1v|pa(v)

is in general idiosyncratic to the parent-child relationship between Xv andits parents. The variation-independence is a reflection of this phenomenon.

We note that gv(xv, l) in the above theorem is obtained for a given setof P 1

v|pa(v) values. Suppose now that the P 1v|pa(v) values are not known but

π1 is known instead. In this case, we need to find gv(xv, l) which minimizesEπ1KL(BN1, BN2) where Eπ1 denotes expectation under the distributionπ1. We will write gπ1

v for this case instead of gv for notational distinction.

Theorem 2. Suppose that P 1 is not known for BN1 but π1 is known in-stead and that P 2 is not known while P 2

v|pa(v)’s satisfy condition (6) forBN2. Then, under the variation-independence and exchangeability condi-tions, Eπ1KL(BN1, BN2) is minimized when, for every l ∈ {0, 1, · · · ,|pa(v)|},

gπ1

v (xv , l) = Eπ1P 1v|pa(v)(xv|y) ∀y ∈ Xpa(v) such that δv(y) = l. (10)

Proof: The proof is analogous to that of Theorem 1 and so is attached inAppendix B.

Suppose that |pa(v)| = 3. Then

Xpa(v) = {(0, 0, 0)︸ ︷︷ ︸0

, (0, 0, 1), (0, 1, 0), (1, 0, 0)︸ ︷︷ ︸1

,

(0, 1, 1), (1, 0, 1), (1, 1, 0)︸ ︷︷ ︸2

, (1, 1, 1)︸ ︷︷ ︸3

}, (11)

where the elements (x’s) are grouped according to δ(x). For four vectors,x0,x1,x2,x3, in (11), with δ(xi) = i, we have that

x0 ≺ xi ≺ x3, i = 1, 2. (12)

But not every pair in Xpa(v) are ordered. For instance, (0, 1, 0) and (1, 0, 1)are not ordered. According to the exchangeability condition onπ1, P 1

v|pa(v)(xv|(0, 0, 1)), P 1v|pa(v)(xv|(0, 1, 0)), and P 1

v|pa(v)(xv |(1, 0, 0)),all have the same distribution. For any x1 vector, there always exist at leastone x2 vector such that x1 ≺ x2. Thus under the positive association (5)and exchangeability conditions, it follows that

Eπ1P 1v|pa(v)(1|x0) ≤ Eπ1P 1

v|pa(v)(1|x1) ≤ Eπ1P 1v|pa(v)(1|x2)

≤ Eπ1P 1v|pa(v)(1|x3).

In this respect, when π1 is known instead of P 1, we take gπ1v (1, l)

as Eπv1Θ(l+1) where Θ(i) is the ith smallest of the random sample,

S-H. Kim and G. Noh436

Θ1, · · · ,Θ|pa(v)|+1, from the distribution πv1 under the condition (6). These

Eπv1Θ(l+1) values are used for BN2 in the simulation experiments in later

sections.In constructing a BN model for XV under the positive association

condition (5) on P 1v|pa(v), we assign probability values as illustrated in Ex-

ample 1. In the case that |pa(v)| = 3, we generate eight random quantitiesfrom a base distribution of πv

1 and assign them to Pv|pa(v)(1|x), x ∈ Xpa(v),in the order of the elements in (11) from small to large allowing order distor-tions between some of Pv|pa(v)(1|x) values for which the x’s are not com-parable in the sense of �.

Suppose we are given a BN model BN1 = (G,D1). A BN model forwhich gπ1

v values are used in place of P 1v|pa(v)(1|xpa(v)) will be called the

similar BN model of BN1 and be denoted by BN1. Since gπ1v minimizes

Eπ1KL(BN1, BN2) as shown in Theorem 2, the similar BN model is aBN model whose Kullback-Leibler divergence with BN1 is the smallestamong all the BN models which share the same model structure as BN1 onaverage under the distribution π1. In this sense, it is of interest to comparea BN model with its similar version in the context of their classificationperformance.

A BN model BN1 itself is not ready for computation for classificationsince the conditional probability P 1

v|pa(v)(1|xpa(v)) is yet a random variabletherein. When the conditional probability is replaced with a list of numericvalues corresponding to the values of xpa(v) for every node v, we will call theBN model an actual BN model of BN1. We will compare the performanceof BN1 with the actual BN model of BN1 in the following two sections.

4. An Example of Similar BN Models

The agreement level of the classifications between the two types ofmodels, actual and similar model of BN1, is defined as follows:

Suppose we compare predictions for subject j with regard to a random vari-able Xv and denote the class levels from an actual BN model and the similarBN model of BN1, respectively, by Y a

jv and Y sjv whose production process

is described in detail below. We define ejv = Y sjv − Y a

jv. So, if the classlevels range from 1 through L, then −L + 1 ≤ ejv ≤ L − 1. Suppose thatthere are N subjects for whom classifications are to be made. Then we candefine the agreement level up to difference d (d > 0) by

αvd =

∑h: |h|≤d

rvh , (13)

Bayesian Networks 437

U1

U6

U2 U5

U4

U3

X2

X9X8

X7

X4

X5

X6X1

X3

Figure 2. A BN where classifications are made for U variables based on X variables.

where

rvh =the number of cases that ejv = h

N.

We will use αv0 and αv

1 as the two primary indices of agreement. Asfor BN1, we will consider the Beta distribution with parameters a and b(denoted by B(a, b)) and some of its variations as a base distribution of πv

1.Let V be partitioned into {V1, V2} and suppose that we are inter-

ested in classifying for XV2given an outcome of XV1

. In Figure 2, XV2=

(U1, · · · , U6). We use U for unobservable variables. In the rest of this sec-tion, we obtain the α values for Xv, v ∈ V2, by a simulation experimentfor a given distribution D1. Denote by xi the outcome of X for the ith

case of data and by xiv the xv component of vector xi. For a set A, letxi,A = (xiv, v ∈ A).

In the simulation, we obtained the α values as follows:

(i) For a given BN1 = (G,D1), construct a similar BN model, BN1.(ii) Generate a set of numeric values of the conditional probabilities,

{P 1v|pa(v)(1|x), v ∈ V } from πv

1 and construct an actual BN modelusing the generated values.

(iii) Generate N vector values, xi, i = 1, · · · , N , of XV from the actualmodel.

(iv) For each v ∈ V2, obtain Y aiv and Y s

iv for i = 1, 2, · · · , N . Y aiv = h if

the rank of P (Xiv = 1|xi,V1) is larger than N(h−1)/L and not larger

than Nh/L, where P (Xiv = 1|xi,V1)’s are computed in the actual

model and the rank is assigned in the reverse order of the magnitudeof P (Xiv = 1|xi,V1

), i = 1, 2, · · · , N . Y siv’s are obtained analogously

except that P (Xiv = 1|xi,V1)’s are computed in BN1.

(v) For each v ∈ V2, obtain rvh, −L+ 1 ≤ h ≤ L− 1.(vi) Repeat steps (ii)-(v) B times and compute the average of the B rvh

values for each v ∈ V2.

S-H. Kim and G. Noh438

It is important to note in the above process that B different actualBN models are constructed and that the values of XV are generated fromeach actual model and the comparison of class levels between actual andsimilar models is based on the conditional probabilities, P (Xiv = 1|xi,V1

),v ∈ V2 which can be obtained by evidence propagation algorithms suchas HUGIN (Andersen, Jensen, Olesen, and Jensen 1989), ERGO (NoeticSystems 1991), and MSBNx (MSBNx 2001). We used the R function,“Grappa.r” (http://www.stats.bris.ac.uk/∼peter/Grappa/), for the evidencepropagation.

We denote by rvh the average of the rvh values and replace the rvh in(13) by rvh as in

αvd =

∑h: |h|≤d

rvh. (14)

Since B different sets of numeric values of the conditional probabilities,{P 1

v|pa(v)(1|x), v ∈ V }, are used for actual BN models from BN1, the α

values as in (14) can be regarded as an estimate of the mean of the agreementlevel of the classifications between the actual BN model and the similar BNmodel, when πv

1 is the true distribution for {P 1v|pa(v)(1|x), x ∈ Xpa(v)}

for v ∈ V . In the simulation experiment below, we set N = 1, 000 andB = 100.

We will take as an example B(0.2, 0.4) for the base distribution ofπv

1, v ∈ V . When the conditional probabilities, {P 1v|pa(v)(1|x), v ∈ V }, are

given as order statistics from U(0, 1) = B(1, 1), the P 1v|pa(v) values tend to

be evenly dispersed over the interval between 0 and 1. But the P 1v|pa(v)(1|x)

values are more likely to appear near the end points of the interval whenthey are from B(0.2, 0.4). In the case of educational testing, if Xv is anitem score (1 for correct answer, 0 otherwise) and Xpa(v) is a set of vectorsof the knowledge states (1 for good enough, 0 otherwise) of the knowledgeunits that are required for the test item, then the P 1

v|pa(v) values tend to besmall when δ(xpa(v)) < |pa(v)| and near 1 otherwise (see, for example,Mislevy 1994). In this respect, the Beta distribution is a reasonable choicefor πv

1, v ∈ V .The mean of the jth smallest among a random sample of size n from

a Beta distribution is not available in a closed form, and the gπ1v values

are computed by a numerical approach. Table 1 lists the gπ1v values for

|pa(v)| ≤ 6, which are used for constructing BN1 corresponding toB(0.2, 0.4) as the base distribution of πv

1, v ∈ V .It is indicated in Table 2 when L = 5 that, on average, the classi-

fications from the two types of models, actual and similar models, exactlyagree for about 57% of the cases (see the column of α0 in the table) and theaverage of the six values in the last column is about 0.93.

Bayesian Networks 439

Table 1. The gπ1v values of (10) for |pa(v)| ≤ 6 when the base distribution of πv

1 isB(0.2, 0.4)

|pa(v)| = 0 : gπ1v (0) = 0.3333

|pa(v)| = 1 : gπ1v (i) =

{0.1328 if i = 0

0.5338 if i = 1

|pa(v)| = 2 : gπ1v (i) =

⎧⎪⎪⎨⎪⎪⎩0.0299 if i = 0

0.2783 if i = 1

0.7567 if i = 2

|pa(v)| = 3 : gπ1v (i) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩0.0035 if i = 0

0.0801 if i = 1

0.5072 if i = 2

0.9012 if i = 3

|pa(v)| = 4 : gπ1v (i) =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

0.0002 if i = 0

0.0116 if i = 1

0.2185 if i = 2

0.7508 if i = 3

0.9724 if i = 4

|pa(v)| = 5 : gπ1v (i) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

0.0000 if i = 0

0.0010 if i = 1

0.0537 if i = 2

0.4580 if i = 3

0.9100 if i = 4

0.9940 if i = 5

|pa(v)| = 6 : gπ1v (i) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

0.0000 if i = 0

0.0000 if i = 1

0.0082 if i = 2

0.1829 if i = 3

0.7136 if i = 4

0.9749 if i = 5

0.9988 if i = 6

S-H. Kim and G. Noh440

Table 2. α and rvh values with L = 5 and B = 50 for the model in Figure 2 when the basedistribution ofπv

1 is B(0.2, 0.4)Ui rk,−4 rk,−3 rk,−2 rk,−1 α0 rk1 rk2 rk3 rk4 α1

U1 0.000 0.002 0.023 0.186 0.544 0.182 0.050 0.012 0.001 0.912(0.000) (0.007) (0.025) (0.056) (0.100) (0.053) (0.050) (0.031) (0.004) (0.083)

U2 0.000 0.001 0.024 0.190 0.565 0.180 0.035 0.006 0.001 0.934(0.000) (0.002) (0.026) (0.060) (0.097) (0.051) (0.030) (0.016) (0.003) (0.056)

U3 0.000 0.002 0.017 0.173 0.611 0.166 0.026 0.004 0.000 0.950(0.000) (0.008) (0.026) (0.057) (0.138) (0.061) (0.035) (0.009) (0.001) (0.058)

U4 0.001 0.001 0.021 0.165 0.607 0.179 0.026 0.001 0.000 0.951(0.004) (0.005) (0.025) (0.059) (0.113) (0.066) (0.037) (0.003) (0.001) (0.054)

U5 0.000 0.002 0.020 0.159 0.624 0.173 0.020 0.002 0.000 0.957(0.000) (0.005) (0.022) (0.071) (0.117) (0.058) (0.026) (0.005) (0.000) (0.043)

U6 0.000 0.005 0.045 0.193 0.486 0.180 0.061 0.020 0.009 0.860(0.001) (0.011) (0.040) (0.075) (0.126) (0.055) (0.050) (0.048) (0.037) (0.127)

Average 0.000 0.002 0.025 0.178 0.573 0.177 0.036 0.007 0.002 0.927

NOTE: The values in the parentheses are standard deviations.

It is worthwhile to note in Table 1 that the gπ1v (0) values become very

small as |pa(v)| increases. If we consider educational testing with multi-ple choice tests, these values are unreasonably small since multiple choicetests allow guessing for selecting response options. An example of the basedistribution of πv

1 for this situation is

0.85 · B(a, b) + 0.1, (15)

which means that guessing shrinks the range of the conditional probabilitiesfrom B(a, b) to 0.85 · B(a, b) + 0.1.

Given the gπ1v values of (10) corresponding to the base distribution

of πv1, B(0.2, 0.4), we can easily obtain the gπ1

v values corresponding tothe base distribution of πv

1, aB(0.2, 0.4) + (0.95 − a). In expression (10),P 1v|pa(v)(1|y) is regarded as a random variable from a base distribution of

πv1. So if we denote by gπ1

v,a the gπ1v values corresponding to the base distri-

bution ofπv1, aB(0.2, 0.4)+(0.95−a), and by g∗v corresponding to the base

distribution of πv1, B(0.2, 0.4), then it follows from expression (10) that

gπ1

v,a = ag∗v + (0.95 − a). (16)

We can see in Table A.1 in Appendix A that the α0 and α1 values for the sixU variables are, when the base distribution ofπv

1 is given by 0.85B(0.2, 0.4)+ 0.1, on average, about 60% and 95%, respectively.

In this section, we described a computation procedure for agreementlevels (or α values) between actual and similar BN models under the condi-

Bayesian Networks 441

tion that P 1v|pa(v)’s are from a given distribution. We ran simulation experi-

ments as if we knew the true distribution πv1 for P 1

v|pa(v) and computed theagreement levels between the two types of BN models.

We will extend the simulation experiment to a more general versionof πv

1 in next section and compare the agreement levels between the twosituations, in one situation πv

1 is known and in the other it is not known butgiven as a mixture distribution.

5. Agreement Levels Between Two Types of BN Models

In the preceding section we compared classifications between an ac-tual model of BN1 and its similar model, BN1, when the base distributionofπv

1 is one of B(0.2, 0.4) and 0.85B(0.2, 0.4)+0.1. In this section we willconsider a mixture distribution, for the base distribution of πv

1, given by

C + (0.95 − C)× B(A,B) (17)

where A,B,C are independent, A and B are Uniform random variablesover the interval (0.1, 0.9), and C is Uniform over (0.1, 0.3). We selectedthe Uniform distribution over the interval (0.1, 0.9) to avoid too small or toolarge values of P 1

v|pa(v)(1|xpa(v)).The mixture distribution is equivalent to the family of distributions,

F = {c+ (0.95 − c)× B(a, b); a, b ∈ (0.1, 0.9), c ∈ (0.1, 0.3)} (18)

where each distribution in F is equally likely. Suppose that, for each nodev in a BN model, πv

1 has its base distribution in F and that the distributionsmay be different across the nodes of the BN model. Also suppose that wehave an actual BN model of BN1 with the base distribution of πv

1 givenby (17). Its similar BN model is constructed, in light of Theorem 2, underthe condition that gπ1

v is obtained from πv1 whose base distribution is a Beta

distribution, B∗ say, over the interval (0.1, 0.95) whose mean and varianceare the same as those of the mixture distribution in (17). As for the similarBN model, we will write πv

1 instead ofπv1. We will use the Beta distribution,

B∗, as the base distribution of πv1. It is important to observe that P 1

v|pa(v) andgπ1v are from the same family of distributions, e.g., the family as in (18).

Since the base distribution of πv1 is confined to be a Beta distribution

over the interval (0.1, 0.95), the base distribution is given in the form of

0.1 + 0.85B(λ1, λ2) (19)

and λ1 and λ2 can be obtained so that the mean and variance of the distribu-tion in (17) may be equal to the mean and variance of the base distributionof πv

1, respectively.

S-H. Kim and G. Noh442

Table 3. α0 and α1 values for the model in Figure 2.

(a) The base distribution of πv1 as given by (17) is used for actual models and the base

distribution of πv1 as in (19) for the similar model.

α0 α1

U1 U2 U3 U4 U5 U6 U1 U2 U3 U4 U5 U6

min 0.356 0.339 0.385 0.273 0.329 0.274 0.797 0.752 0.826 0.668 0.774 0.632Q1 0.496 0.472 0.506 0.565 0.554 0.466 0.909 0.882 0.935 0.944 0.946 0.868Q2 0.573 0.580 0.600 0.656 0.645 0.560 0.961 0.952 0.976 0.974 0.982 0.945Q3 0.667 0.680 0.687 0.732 0.724 0.626 0.988 0.984 0.996 0.994 0.992 0.978max 0.865 0.850 0.864 0.856 0.909 0.829 1.000 1.000 1.000 1.000 1.000 1.000

(b) The base distribution of πv1 as in (19) is used for both the actual and similar models.

α0 α1

U1 U2 U3 U4 U5 U6 U1 U2 U3 U4 U5 U6

min 0.338 0.366 0.416 0.411 0.380 0.263 0.754 0.814 0.821 0.859 0.871 0.603Q1 0.549 0.510 0.588 0.583 0.571 0.490 0.954 0.942 0.957 0.958 0.953 0.879Q2 0.630 0.637 0.661 0.670 0.683 0.563 0.976 0.972 0.982 0.988 0.990 0.946Q3 0.730 0.716 0.734 0.771 0.746 0.655 0.995 0.989 0.998 0.999 0.999 0.987max 0.851 0.867 0.939 0.912 0.904 0.797 1.000 1.000 1.000 1.000 1.000 1.000

We computed the agreement levels between the two types of BN mod-els, the actual and the similar model, and summarized the levels in terms ofthe three quantities, the first(Q1), the second(Q2), and the third(Q3) quar-tiles of B(=100) αi values, i = 0, 1, which are obtained by iterating 100times the comparison of the classifications between the two types of mod-els. At each iteration, we construct an actual model of BN1 and use thegπ1v ’s from the base distribution of πv

1 in (19) for the similar model, BN1.The result is summarized in Table 3(a) for the BN model in Figure 2.

The α values in Table 3(a) are displayed in boxplots in Figure 3 as a visualaid. The boxplots for the other tables show essentially the same patternexcept that the α values in panel a of the table are a bit smaller than those inpanel b which are obtained under the condition that πv

1 is used for both theactual and similar models.

The mixture distribution for πv1 can be regarded as representing our

uncertainty on the true values of the conditional probabilitiesP 1v|pa(v)(1|xpa(v)). In Table 3(b), we summarize the comparison result be-

tween the actual and similar models which share the same distribution πv1.

In other words, the comparison of the classifications is made between twomodels for which P 1

v|pa(v)(1|xpa(v))’s are obtained from the same distribu-

tion, πv1. The comparison thus shows us the agreement level which is af-

fected by the chancy fluctuation of the rank-orders between the two BNmodels.

Bayesian Networks 443

1A0

2A0

3A0

4A0

5A0

6A0

1A1

2A1

3A1

4A1

5A1

6A1

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3. Boxplots for the α0 and α1 values in panel a of Table 3. The boxplot labeled ‘1A0’is the plot of 100 α0 values for U1 and the boxplot labeled ‘6A1’ is of 100 α1 values for U6.The two vertical lines at 0.5 and 0.9 are for visual aid.

For convenience, we will call the five BN models in Figures 2, A.1,A.2, A.3, and A.4, models 1, 2, 3, 4, and 5, respectively, in the order of theirfigures, where the last four figures are in Appendix A. The five BN modelsare different as follows. Model 2 is obtained by adding to model 1 six arrowsbetween U variables and X variables and one arrow between U variables.Model 3 is obtained by removing arrows from model 2 so that U1, U4, andU6 become root nodes, i.e., do not have any parent nodes, and the threesets, {U1, U2, U3, U5}, {U4}, and {U6}, are marginally independent amongthemselves. By removing all the arrows between the U variables in model3 we obtain model 4 so that all the U variables are marginally independentamong themselves. All the U variables of model 4 are connected completelyin model 5.

The main difference between the first two of the five BN models is thatall the X variables each has multiple parent nodes while it is not the casein model 1. As for models 2, 3, and 4, we see more independencies amongthe U variables in the order of the model label. We considered this variationamong the model structures to see if this variation affects the agreementlevel, which is not likely as is shown in Figure 4. The Q2 values (or medians)of the α values as listed in Tables 3 and A.2 are depicted in the figure. Thefirst graph in each of panels a and b is of the medians(Q2) of the α0 valuesand the second one for α1 values. The α values in the panels are obtained

S-H. Kim and G. Noh444

1 2 3 4 5 6

0.4

0.5

0.6

0.7

0.8

0.9

1.0

U variable

alph

a_0

M1M2

M3M4

M5

1 2 3 4 5 6

0.4

0.5

0.6

0.7

0.8

0.9

1.0

U variable

alph

a_1

M1M2

M3M4

M5

(a)πv1 for the actual model and πv

1 for the similarmodel.

1 2 3 4 5 6

0.4

0.5

0.6

0.7

0.8

0.9

1.0

U variable

alph

a_0

M1M2M3M4

M5

1 2 3 4 5 6

0.4

0.5

0.6

0.7

0.8

0.9

1.0

U variable

alph

a_1

M1M2M3M4

M5

(b) πv1 for both of the actual and similar models.

Figure 4. Medians (Q2) of the α values from the five BN models for the six U variables,U1, U2, · · · , U6, which are listed in Table 3. M1, · · · , M5 in the graph are another notationof M1, · · · ,M5, respectively.

Bayesian Networks 445

under the conditions described in the caption of each panel. We can see inboth of the panels that models 4 and 5 are farthest apart on average among allthe possible pairs of the models. However, we can check in Tables 3 and A.2that the mid-50% interval of the αi (i = 0, 1) values from model 4 overlapthe mid-50% interval from model 5 or vice versa for all of the U variables.This indicates that the model structure or marginal independencies amongU variables may not be very influential on the agreement level between thetwo types of models, the actual and similar models.

To sum up, πv1 is based on a mixture distribution, in our experimental

study, and πv1 on a Beta distribution with the same mean and variance as

those of the base distribution of πv1. We considered two situations. In one

situation (corresponding to panel a of Tables 3 and A.2) the P 1v|pa(v)’s for

the actual BN model are from πv1 and they are from πv

1 in the other situation(corresponding to panel b). We examined the empirical distributions of α0

and α1 values under the two situations for five BN models. The medians ofthe distributions are displayed in Figure 4, which indicates that the α valuesin panel b are slightly larger than those in panel a. On average across thesix U variables and the five BN models, the α0 values are larger by about0.04 (=0.618-0.579) and the α1 values by about 0.015 (=0.968-0.953). Theboxplots in Figure 3 indicate that the distribution of the α0 values is moreor less symmetric about its median while the distribution of the α1 values isskewed towards the left with a considerably longer tail than the other side ofthe distribution. In consideration of this variation of distribution shapes, weused the median rather than the mean for the comparison of the α values.

6. Concluding Remarks

The notion of similarity between models is defined under the premisethat the models are of the same model structure. In the definition, we as-sume that the underlying distribution, D1, is known. But, in reality, we havedata for observable variables only and the classification is usually made forunobservable variables. It may be desirable, when a timely service of clas-sification is required and an acceptable development of a BN model is yet inprogress, that we use some alternative model for the classification which ismade within some pre-specified tolerable limits. The simulation experimentstrongly suggests, under the positive association condition, that we may usea similar model, BN1, for classification in addition to the agreement lev-els which are computed by comparing the two types of models, actual andsimilar models. We considered α0 (perfect agreement) and α1 (agreementup to one-class-level difference) only in this paper. But this is just a matterof tolerance in classification. One could use other αi values as long as theclassification tolerance permits.

S-H. Kim and G. Noh446

A main idea behind the proposed similar model approach for classi-fication is analogous to Bayesian approach for statistical inference. We usea πv

1 distribution for the similar model which is obtained under a prior as-sumption on the conditional probabilities, P 1

v|pa(v)(1|xpa(v)). Suppose thatwe have a data set of the observable variables, XV1

, and that we are not sureof the probability models of XV except its model structure, which is givenin a BN. Then we assume a set of prior distributions for P 1

v|pa(v)(1|xpa(v))

and regard the data as obtained from a distribution D1 which is given as aproduct of P 1

v|pa(v)(1|xpa(v))’s that are considered as random quantities fromdistributions in F in expression (18). Instead of doing a time-consumingBayesian inference for P 1

v|pa(v)(1|xpa(v))’s, we propose using a similar model,

BN1, which is obtained from a given prior on P 1v|pa(v)(1|xpa(v))’s, for clas-

sification.The similar model approach for classification along with the α values

which are obtained in the same way as for Table 3(a) can be used as a sur-rogate classifier of a group of subjects when consistency of classification isrequired, provided that we can reasonably assume a mixture distribution asa base distribution of πv

1 as illustrated in expression (17). The α values as inTable 3(b) may be used as a reference for the range of the α values that wecan get when we are sure of the true distribution of P 1

v|pa(v). In this sense,these α values may be regarded as the agreement level under the chancyfluctuation of data.

Appendix A: α Values for Other BN Models

Four more results in addition to the result of Section 5 follow. Theseresults are given on the next pages in Tables A.1 and A.2, followed by thecorresponding Figures A.1 to A.4.

Appendix B: Proof of Theorem 2

As in the proof of Theorem 1, we may consider thatXv = {0, 1, 2, · · · ,Iv} in the proof.

Eπ1KL(BN1, BN2)

=∑

xV ∈XV

Eπ1

(P 1(xV ) log

P 1(xV )

P 2(xV )

)

=∑

xV ∈XV

Eπ1

(P 1(xV )

∑v∈V

logP 1v|pa(v)(xv|xpa(v))

gπ1v (xv, δv(x))

)

=∑v

|pa(v)|∑k=0

∑xfa(v);δv(xpa(v))=k

Eπ1

(P 1(xfa(v)) log

P 1v|pa(v)(xv |xpa(v))

gπ1v (xv, k)

).

Bayesian Networks 447

Table A.1 α and rvh values with L = 5 and B = 50 for the model in Figure 2 when thebase distribution ofπv

1 is 0.85 · B(0.2, 0.4) + 0.1

Ui rk,−4 rk,−3 rk,−2 rk,−1 α0 rk1 rk2 rk3 rk4 α1

U1 0.000 0.004 0.029 0.178 0.594 0.176 0.018 0.001 0.000 0.948(0.001) (0.008) (0.024) (0.042) (0.110) (0.048) (0.025) (0.003) (0.001) (0.051)

U2 0.000 0.003 0.030 0.172 0.602 0.172 0.020 0.001 0.000 0.947(0.000) (0.004) (0.029) (0.040) (0.111) (0.053) (0.028) (0.003) (0.000) (0.057)

U3 0.000 0.003 0.029 0.190 0.564 0.189 0.025 0.000 0.000 0.942(0.001) (0.008) (0.029) (0.046) (0.117) (0.062) (0.038) (0.001) (0.000) (0.066)

U4 0.000 0.002 0.022 0.163 0.635 0.168 0.010 0.000 0.000 0.966(0.000) (0.004) (0.027) (0.046) (0.121) (0.066) (0.016) (0.000) (0.000) (0.038)

U5 0.000 0.002 0.021 0.162 0.639 0.165 0.011 0.001 0.000 0.966(0.002) (0.005) (0.024) (0.053) (0.127) (0.071) (0.015) (0.004) (0.000) (0.034)

U6 0.002 0.007 0.040 0.185 0.550 0.171 0.036 0.007 0.002 0.906(0.007) (0.013) (0.040) (0.052) (0.138) (0.059) (0.039) (0.015) (0.006) (0.092)

Average 0.000 0.004 0.029 0.175 0.597 0.173 0.020 0.002 0.000 0.946NOTE: The values in the parentheses are standard deviations.

Table A.2 α0 and α1 values for the models in Figures A.1, A.2, A.3, and A.4.

For model 2 in Figure A.1:(a) The base distribution of πv

1 given by (17) is used for actual models and the base distri-bution of πv

1 in (19) for the similar model.

α0 α1

U1 U2 U3 U4 U5 U6 U1 U2 U3 U4 U5 U6

min 0.354 0.333 0.372 0.354 0.415 0.328 0.778 0.832 0.761 0.709 0.805 0.756Q1 0.518 0.518 0.514 0.520 0.516 0.488 0.933 0.929 0.941 0.932 0.952 0.899Q2 0.566 0.571 0.605 0.617 0.591 0.548 0.961 0.960 0.972 0.974 0.974 0.942Q3 0.626 0.647 0.690 0.683 0.702 0.619 0.983 0.981 0.994 0.996 0.992 0.974max 0.827 0.864 0.897 0.881 0.903 0.767 1.000 1.000 1.000 1.000 1.000 1.000

(b) The base distribution of πv1 as in (19) is used for both the actual and the similar models.

min 0.266 0.353 0.392 0.365 0.361 0.236 0.811 0.820 0.674 0.867 0.789 0.550Q1 0.541 0.553 0.575 0.616 0.570 0.484 0.939 0.941 0.962 0.969 0.973 0.897Q2 0.628 0.621 0.638 0.670 0.663 0.536 0.974 0.975 0.987 0.988 0.992 0.940Q3 0.701 0.680 0.744 0.767 0.736 0.593 0.992 0.993 0.998 0.999 0.999 0.971max 0.811 0.818 0.957 0.907 0.910 0.860 1.000 1.000 1.000 1.000 1.000 1.000

For model 3 in Figure A.2:

(a)πv1 and πv

1 are the same as for model 2.α0 α1

U1 U2 U3 U4 U5 U6 U1 U2 U3 U4 U5 U6

min 0.307 0.244 0.219 0.239 0.252 0.260 0.691 0.690 0.691 0.706 0.663 0.730Q1 0.455 0.492 0.472 0.542 0.492 0.442 0.868 0.901 0.906 0.918 0.904 0.866Q2 0.506 0.565 0.554 0.608 0.587 0.501 0.907 0.946 0.952 0.954 0.967 0.909Q3 0.568 0.644 0.643 0.682 0.710 0.577 0.951 0.979 0.975 0.983 0.996 0.934max 0.791 0.893 0.847 0.917 0.912 0.764 1.000 1.000 1.000 1.000 1.000 0.996

S-H. Kim and G. Noh448

Table A.2 continued:(b) The base distribution of πv

1 as in (19) is used for both the actual and similar models.min 0.289 0.250 0.229 0.313 0.287 0.348 0.713 0.738 0.713 0.779 0.699 0.638Q1 0.472 0.530 0.478 0.578 0.504 0.493 0.896 0.922 0.931 0.946 0.955 0.905Q2 0.568 0.604 0.570 0.674 0.639 0.553 0.948 0.972 0.969 0.977 0.986 0.933Q3 0.665 0.676 0.674 0.757 0.738 0.636 0.975 0.990 0.990 0.996 1.000 0.964max 0.931 0.903 0.888 0.881 0.950 0.781 1.000 1.000 1.000 1.000 1.000 0.999

For model 4 in Figure A.3:

(a)πv1 and πv

1 are the same as for model 2.min 0.154 0.142 0.156 0.156 0.153 0.216 0.563 0.636 0.265 0.289 0.460 0.683Q1 0.436 0.467 0.411 0.399 0.389 0.420 0.901 0.885 0.812 0.819 0.807 0.883Q2 0.598 0.558 0.542 0.543 0.554 0.484 0.940 0.930 0.925 0.931 0.922 0.922Q3 0.709 0.651 0.694 0.698 0.684 0.586 0.977 0.957 0.976 0.976 0.980 0.945max 0.893 0.939 0.891 0.890 0.895 0.943 1.000 1.000 1.000 1.000 1.000 0.995

(b) The base distribution of πv1 as in (19) is used for both the actual and similar models.

min 0.230 0.276 0.179 0.128 0.188 0.265 0.718 0.705 0.293 0.214 0.400 0.663Q1 0.504 0.511 0.426 0.434 0.440 0.471 0.917 0.911 0.834 0.829 0.886 0.885Q2 0.582 0.580 0.550 0.592 0.574 0.536 0.962 0.950 0.933 0.934 0.951 0.927Q3 0.721 0.675 0.676 0.738 0.745 0.606 0.986 0.975 0.978 0.985 0.985 0.964max 0.891 0.933 0.899 0.945 0.948 0.887 1.000 1.000 1.000 1.000 1.000 1.000

For model 5 in Figure A.4:

(a)πv1 given by (17) is used for actual models and πv

1 in (19) for the similar model.min 0.260 0.388 0.341 0.280 0.360 0.362 0.670 0.840 0.798 0.699 0.813 0.766Q1 0.499 0.561 0.525 0.530 0.586 0.546 0.916 0.937 0.942 0.938 0.957 0.936Q2 0.568 0.614 0.611 0.610 0.657 0.603 0.952 0.969 0.970 0.974 0.976 0.969Q3 0.643 0.662 0.680 0.686 0.706 0.678 0.981 0.987 0.992 0.990 0.990 0.986max 0.820 0.816 0.841 0.844 0.883 0.787 1.000 1.000 1.000 1.000 1.000 1.000

(b) The base distribution of πv1 as in (19) is used for both the actual and similar models.

min 0.293 0.426 0.229 0.345 0.314 0.249 0.692 0.792 0.527 0.686 0.768 0.701Q1 0.502 0.554 0.579 0.628 0.599 0.553 0.926 0.955 0.975 0.976 0.968 0.957Q2 0.597 0.623 0.702 0.695 0.674 0.633 0.968 0.979 0.990 0.992 0.986 0.980Q3 0.656 0.694 0.745 0.762 0.743 0.701 0.986 0.993 0.998 0.999 0.998 0.990max 0.807 0.858 0.851 0.897 0.937 0.842 1.000 1.000 1.000 1.000 1.000 1.000

Differentiating the Eπ1KL(BN1, BN2) with respect to gπ1v (xv , l) for xv

and l, xv ∈ Xv \ Iv and 0 ≤ l ≤ |pa(v)|, yields

dEπ1KL(BN1, BN2)

dgπ1v (xv, l)

=

∑xpa(v);δv(xpa(v))=l

[−Eπ1P 1(xv,xpa(v))

gπ1v (xv, l)

+Eπ1P 1(Iv,xpa(v))

gπ1v (Iv , l)

]. (B.1)

Bayesian Networks 449

U1

U6

U2 U5

U4

U3

X2

X9X8

X7

X4

X5

X6X1

X3

U1

U6

U2 U5

U4

U3

X2

X9X8

X7

X4

X5

X6X1

X3

Figure A.1 A BN which is obtained by Figure A.2 A BN which is obtained byadding dashed arrows to the BN in Fig- removing some arrows from the BN ingure 2. Figure A.1.

U1

U6

U2 U5

U4

U3

X2

X9X8

X7

X4

X5

X6X1

X3

U1

U6

U2 U5

U4

U3

X2

X9X8

X7

X4

X5

X6X1

X3

Figure A.3 A BN which is obtained by Figure A.4 A BN which is obtained byremoving all the arrows between the U connecting all the U nodes in the ordernodes from the BN in Figure A.2. of node-labels in the BN in Figure A.2.

Putting the left-hand side of (B.1) equal to zero yields the solutiongv(xv, l), as

gv(xv, l) =

∑xpa(v);δv(xpa(v))=l

Eπ1 [P 1v|pa(v)(xv|xpa(v))P

1pa(v)(xpa(v))]

Eπ1P 1(δv(Xpa(v)) = l)

=

∑xpa(v);δv(xpa(v))=l

Eπ1P 1v|pa(v)(xv|xpa(v)) ·Eπ1P 1

pa(v)(xpa(v))

Eπ1P 1(δv(Xpa(v)) = l)

= Eπ1P 1v|pa(v)(xv|y) for any y ∈ Xpa(v) such that δv(y) = l,

where the second equality holds by the variation-independence conditionand the last equality by the exchangeability condition and the fact that

S-H. Kim and G. Noh450

∑xpa(v);δv(xpa(v))=l

Eπ1P 1pa(v)(xpa(v)) = Eπ1P 1(δv(Xpa(v)) = l).

We can easily check, from (20), that Eπ1KL(BN1, BN2) is mini-mized when gπ1

v (xv , l) = gv(xv, l), l ∈ {0, 1, · · · , |pa(v)|}.�

References

ALBATINEH, A.N., NIEWIADOMSKA-BUGAJ, M., and MIHALKO, D. (2006), “OnSimilarity Indices and Correction for Chance Agreement,” Journal of Classification,23, 301–313.

ANDERSEN, S.K., JENSEN, F.V., OLESEN, K.G., and JENSEN, F. (1989), HUGIN: AShell for Building Bayesian Belief Universes for Expert Systems [computer program],Aalborg, Denmark: HUGIN Expert Ltd.

BRUSCO, M.J., and STEINLEY, D. (2008), “A Binary Integer Program to Maximize theAgreement Between Partitions,” Journal of Classification, 25, 185–193.

COHEN, J. (1960), “A Coefficient of Agreement for Nominal Scales,” Educational andPsychological Measurement, 20, 37–46.

De FINETTI, B. (1972), Probability, Induction, and Statistics, New York: Wiley.

DEMPSTER, A.P., LAIRD, N.M., and RUBIN, D.B.(1977), “Maximum Likelihood fromIncomplete Data Via the EM Algorithm (With Discussion),” Journal of the Royal Sta-tistical Society B, 39, 1–38.

ERGO [computer program] (1991), Baltimore MD: Noetic Systems Inc.

FLEISS, J.L. (1981), Statistical Methods for Rates and Proportions (2nd ed.), New York:Wiley.

HOLLAND, P.W., and ROSENBAUM, P.R. (1986), “Conditional Association and Unidi-mensionality in Monotone Latent Variable Models,” The Annals of Statistics,14(4),1523–1543.

JENSEN, F.V.(1996), An Introduction to Bayesian Networks, New York: Springer-Verlag.

JUNKER, B.W., and ELLIS, J.L. (1997) “A Characterization of Monotone UnidimensionalLatent Variable Models,” The Annals of Statistics, 25(3), 1327–1343.

KIM, S.H. (2002), “Calibrated Initials for an EM Applied to Recursive Models of Categor-ical Variables,” Computational Statistics and Data Analysis, 40(1), 97–110.

KIM, S.H. (2005), “Stochastic Ordering and Robustness in Classification from a BayesianNetwork,” Decision Support Systems, 39, 253–266.

LAURITZEN, S.L., and SPIEGELHALTER, D.J. (1988), “Local Computations with Prob-abilities on Graphical Structures and Their Application to Expert Systems,” Journalof the Royal Statistical Society B, 50(2), 157–224 .

MESSATFA, H. (1992), “An Algorithm to Maximize the Agreement Between Partitions,”Journal of Classification, 9, 5–15.

MSBNx [computer program] (2001), http://research.microsoft.com/en-us/um/redmond/groups/adapt/msbnx/.

MISLEVY, R.J.(1994), “Evidence and Inference in Educational Assessment,” Psychome-trika, 59(4), 439–483.

Bayesian Networks 451

PEARL, J. (1988), Probabilistic Reasoning In Intelligent Systems: Networks of PlausibleInference, San Mateo CA: Margan Kaufmann.

PIRES, A.M., and BRANCO, J.A. (1997), “Comparison of Multinomial ClassificationRules,” Journal of Classification, 14, 137–145.

SPITZER, R.L., COHEN, J., FLEISS, J.L., and ENDICOTT, J. (1967), “Quantification ofAgreement in Psychiatric Diagnosis,” Archives of General Psychiatry, 17, 83–87.

S-H. Kim and G. Noh452