Learning Bayesian nets that perform well

Learning Bayesian Nets that Perform WellRussell GreinerSiemens Corporate Research755 College Road EastPrinceton, NJ [email protected] Adam J. GroveNEC Research Institute4 Independence WayPrinceton, NJ [email protected] Dale Schuurmans�Inst. for Research in Cognitive ScienceUniversity of PennsylvaniaPhiladelphia, PA [email protected] Bayesian net (BN) is more than a succinctway to encode a probabilistic distribution; italso corresponds to a function used to answerqueries. A BN can therefore be evaluatedby the accuracy of the answers it returns.Many algorithms for learning BNs, however,attempt to optimize another criterion (usu-ally likelihood, possibly augmented with aregularizing term), which is independent ofthe distribution of queries that are posed.This paper takes the \performance criteria"seriously, and considers the challenge of com-puting the BN whose performance | read\accuracy over the distribution of queries"| is optimal. We show that many aspectsof this learning task are more di�cult thanthe corresponding subtasks in the standardmodel.

Appears inProceedings of the Thirteenth Conference on Uncertainty in Arti�cial Intelligence (UAI-97),Providence, RI, August 1997.

1 INTRODUCTIONMany tasks require answering questions; this modelapplies, for example, to both expert systems that iden-tify the underlying fault from a given set of symp-toms, and control systems that propose actions on thebasis of sensor readings. When there is not enoughinformation to answer a question with certainty, theanswering system might instead return a probabil-ity value as its answer. Many systems that answersuch probabilistic queries represent the world using a\Bayesian net" (BN), which succinctly encodes a dis-tribution over a set of variables. Often the underlyingdistribution, which should be used to map questions toappropriate responses, is not known a priori. In suchcases, if we have access to training examples, we cantry to learn the model.There are currently many algorithms for learningBNs [Hec95, Bun96]. Each such learning algorithm�Also: NEC Research Institute, Princeton, NJ

tries to determine which BN is optimal, usually basedon some measure such as log-likelihood (possibly aug-mented with a \regularizing" term, leading to mea-sures like MDL [LB94], and Bayesian Information Cri-terion (BIC) [Sch78]). However, these typical mea-sures are independent of the queries that will be posed.To understand the signi�cance of this, note that wemay only care about certain queries (e.g., the prob-ability of certain speci�c diseases given a set of ob-served symptoms); and a BN with the best (say) log-likelihood given the sample may not be the one whichproduces the appropriate answers for the queries wecare about. This paper therefore argues that BN-learning algorithms should consider the distributionof queries , as well as the underlying distribution ofevents, and should therefore seek the BN with the bestperformance over the query distribution, rather thanthe one that appears closest to the underlying eventdistribution.To make this point more concrete, suppose we knewthat all queries will be of the form p(H j J; B ) for someassignments to these variables (e.g., Hepatitis, giventhe possible symptoms Jaundice=false and Blood test= true). Given a set of examples, our learner has to de-cide which BN (perhaps from some speci�ed restrictedset) is best. Now imagine we had two candidates BNsfrom this set: B1, which performs optimally on thequeries p(H j J; B ), but does horribly on other queries(e.g., incorrectly claims that J and B are conditionallyindependent, has the completely wrong values for theconditional probability ofH to the treatment T (\Takeaspirin"), and so on); versus B2, which is slightly o�on the p(H j J;B ) queries, but perfect on all otherqueries. Here, most measures would prefer B2 overB1, as they would penalize B1 for its errors on thequeries that will never occur! Of course, if we re-ally do only care about p(H j J; B ), this B2-over-B1preference is wrong.This assumes we have the correct distributions, of boththe real world events (e.g., quantities like p(H=1 j J=0; B = 1 ) = 0:42), and the queries that will beposed (e.g., 48% of the queries will be of the form\What is p(H = h j J = j; B = b )?"; 10% will be

\What is p(H = h jS1 = v1; S4 = v4; S7 = v7 )?",etc.). Another more subtle problem with the maximal-likelihood{based measures arises when these distribu-tions are not given explicitly, but must instead be es-timated from examples. Here, we would, of course,like to use the given examples to obtain good esti-mates of the conditional probabilities P (H jJ; B). Inthe general maximal-likelihood framework, however,the examples would be used to �t all of the param-eters within the entire BN, so we could conceivably\waste" some examples or computational e�ort learn-ing the value of irrelevant parameters. In general, itseems better to focus the learner's resources on therelevant queries (but see Section 4).Our general challenge is to acquire a BN whose per-formance is optimal, with respect to the distributionof queries, and the underlying distribution of events.Section 2 �rst lays out the framework by providingthe relevant de�nitions. Section 3 then addresses sev-eral issues related to learning a BN whose accuracy(by this measure) is optimal: presenting the computa-tional/sample complexities of �rst evaluating the qual-ity of a given BN and then of �nding the best BN ofa given structure. It then provides methods for hill-climbing to a locally optimal BN. We will see thatthese tasks are computationally di�cult for generalclasses of queries; Section 3 also presents a particularclass of queries for which these tasks are easy. Sec-tion 4 then re ects on the general issue of how to bestuse knowledge of the query distribution to improvethe e�ciency of learning a good BN under our model.Here we show situations where this information maylead to ways of learning a BN (of a given structure)that are more sample-e�cient than the standard ap-proach. We �rst close this section by discussing howour results compare with others.Related Results: The framework closest to ours isFriedman and Goldszmidt [FG96], as they also con-sider �nding the BN that is best for some distributionof queries, and also explain why the BN with (say)maximal log-likelihood may not be the one with op-timal performance on a speci�c task. In particular,they note that evaluating a Bayesian net B, given aset of training data D = fci; ai1; : : : ; aingNi=1, under thelog-likelihood measure, amounts to using the formulaLL(BjD) = PNi=1 logB( ci j ai1; : : : ; ain )+ PNi=1 logB( ai1; : : : ; ain )where B(� ) is the probability that B assigns to theevent �. If all of the queries, however, ask for thevalue of c given values of ha1; : : : ani, then only the�rst summation matters. This means that systemsthat use LL(BjD) to rank BNs could do poorly if thecontributions of second summation dominate those ofthe �rst. The [FG96] paper, however, considers onlybuilding BNs for classi�cation, i.e., where every queryis of the speci�c form p(C = c jA1 = a1; : : : ; An = an )

where C is the only \consequent" variable, and fAigis the set of all other variables; their formulation alsoimplicitly assumes that all possible query-instances (ofcomplete tuples) will occur, and all are equally likely.By contrast, we do not constrain the set of queries tobe of this single form, nor do we insist that all queriesoccur equally often, nor that all variables be involvedin each query, Note in particular that we allow thequery's antecedents to not include the Markov blan-ket of the consequent; we will see that this restrictionconsiderably simpli�es the underlying computation.Each of the queries we consider is of the form \p(X=x jY = y ) = ?", where X;Y are subsets of the vari-ables, and x;y are respective (ground) assignmentsto these variables. As such, they resemble the stan-dard class of \statistical queries", discussed by Kearnsand others [Kea93] in the context of noise-tolerantlearners.1 In that model, however, the learner is pos-ing such queries to gather information about the un-derlying distribution, and the learner's score dependsits accuracy with respect to some other speci�c set ofqueries (here the same p(C=c jA1=a1; : : : ; An=an )expression mentioned above). In our model, by con-trast, the learner is observing which such queries areposed by the \environment", as it will be evaluatedbased on its accuracy with respect to these queries.Other researchers, including [FY96, H�of93], also com-pute the sample complexity for learning good BNs.They, however, deal with likelihood-based measures,which (as we shall see) have some fundamental di�er-ences from our query-answering based model; hence,our results are incomparable.2 FRAMEWORKAs a quick synopsis: a Bayesian net is a directedacyclic graph hV ; Ei, whose nodes represent variables,and whose arcs represent dependencies. Each nodealso includes a conditional-probability-table that spec-i�es how the node's values depends (stochastically) onthe values of its parents. (Readers unfamiliar withthese ideas are referred to [Pea88].)In general, we assume there is a stationary un-derlying distribution P over the N variables V =fV1; : : : ; VNg. (I.e., p(V1 = v1; : : : ; VN = vN ) � 0and Pv1;:::;vN p(V1 = v1; : : : ; VN = vN ) = 1). Forexample, perhaps V1 is the \disease" random variable,whose value ranges over fhealthy; cancer; u; : : :g; V2 is\gender" 2 fmale; femaleg, V3 is \body temperature"2 [95::105], etc. We will refer to this as the \underly-ing distribution" or the \distribution over events".A statistical query is a term of the form \p(X=x jY=y ) = ?", where S; T � V are (possibly empty) subsetsof V , and x (resp,, y) is a legal assignment to the1Of course, other groups have long been interested inthis idea; cf., the work in �nding statistical answers fromdatabase queries.

elements of X (resp, Y). We let SQ be the set of allpossible legal statistical queries,2 and assume there isa (stationary) distribution over SQ, written sq(X =x ; Y=y ), where sq(X=x ; Y=y ) is the probabilitythat the query \What is the value of p(X = x jY =y )?" will be asked. We of course allow Y = fg; herewe are requesting the prior value of X, independent ofany conditioning. We also write sq(x ; y ) to refer tothe probability of the query \p(X=x jY=y ) = ?"where the variable sets X and Y can be inferred fromthe context.While all of our results are based on such \ground"statistical queries, we could also de�ne sq(X ; Y ) torefer to the probability that some query of the gen-eral form \p(X = x jY = y ) = ?" will be asked,for some assignments x;y; we could assume that allassignments to these unspeci�ed variables are equallylikely as queries. Finally, to simplify our notation, wewill often use a single variable, say q, to represent theentire [X=x;Y=y] situation, and so will write sq( q )to refer to sq(X=x ; Y=y ).As a simple example, the claim sq(C ; A1; : : : ; An ) =1 states that our BN will only be used to �nd clas-si�cations C given the values of all of the variablesfAig. (Notice this is not asserting that p(C = c jA1 =a1; : : : ; An = an ) = 1 for any set of assignmentsfc; aig.) If all variables are binary, this correspondsto the claim that sq(C = c ; A1 = a1; : : : ; An = an ) =1=2n+1 for each assignment. Alternatively, we can usesq(C ; A1; A2; A3 ) = 0:3, sq(C ; A1; A2 ) = 0:2, andsq(D ; C = 1; A1 = 0; A3 ) = 0:25, and sq(D ; C =1; A1 = 1; A3 ) = 0:25, to state that 30% of the queriesinvolve seeking the conditional probability of C giventhe observed values of the 3 attributes fA1; A2; A3g;20% involve seeking the probability of C given only the2 attributes fA1; A2g; 25% seek the probability of (thedi�erent \consequent") D given that C = 1, A1 = 0and some observed value of A3, and the remaining 25%seek the probability of D given that C = 1, A1 = 1and some observed value of A3.If each of the N variables has domain f0; 1g, then theSQ distribution has O(5N ) parameters, because eachvariable vi 2 V can play one of the following 5 roles ina query:8>><>>: � v 2 Xv = 1 � ; � v 2 Xv = 0 � ; � v 2 Yv = 1 � ;� v 2 Yv = 0 � ; � v 62 Sv 62 T � 9>>=>>;(We avoid degeneracies by assuming Y \X = fg.)Notice that we assume that sq( � ; � ) can be, in general,completely unrelated to p( � j � ), because the probabil-2A query sq(X=x ; Y=y ) is \legal" if p(Y=y ) > 0.Note also that we use CAPITAL letters to represent singlevariables, lowercase letters for that values that the vari-ables might assume, and the boldface font when dealingwith sets of variables or values.

ity of being asked about sq(X=x ; Y=y ) need not becorrelated (or at least, not in any simple way) with thevalue of the conditional probability p(X=x jY=y ).The fact that the underlying p( � ) is stationary sim-ply means that the query sq( � ; � ) has a determi-nate answer given by the true conditional probabilityp(X = x jY = y ) 2 [0; 1]. In general, we call eachtuple hX=x; Y=y; p(X=x jY=y )i a \labeled sta-tistical query".Now �x a network (over V) B, and let B(x jy ) =B(X = x jY = y ) be the real-value (probability)that B returns for this assignment. Given distribu-tion sq( � ; � ) over SQ, the \score" of B iserrsq; p(B ) = Xx;y sq(x ; y ) [B(x jy )� p(x jy )]2 (1)where the sum is over all assignments x;y to all sub-setsX;Y of variables. (We will often write this simplyerr(B ) when the distributions sq and p are clear fromthe context.) Note this depends on both the underly-ing distribution p( � ) over V , and the sq( � ) distributionover queries SQ.Given a set of labeled statistical queries Q =fhxi; yi; piigi we leterrQ(B ) = 1jQj Xhx;y; pi2Q [B(x jy )� p]2be the \empirical score" of the Bayesian net.For comparison, we will later use KL(B ) =Pd p( d ) log p( d )B( d ) to refer to the Kullback-Liebler di-vergence between the correct distribution p( � ) andthe distribution represented by the Bayesian net B(�).Given a set D of event tuples, we can approximatethis score using KLD(B ) = 1jDjPd2D log 1=jDjB( d ) .Note (1) that small KL divergence corresponds to thelarge (log)likelihood, and (2) that neither KL(B ) norKLD(B ) depend on sq( � ).Finally, let SQB � SQ be the class of queries whose\consequent" is single literal X = fV g, and whose\antecedents"Y are a superset of V 's Markov blanket,with respect to the BN B; we will call these \Markov-blanket queries".3 LEARNING ACCURATEBAYESIAN NETSOur overall goal is to learn the Bayesian Net with theoptimal performance, given examples of both the un-derlying distribution, and of the queries that will beposed (i.e., instances of fV1 = v1; : : : ; VN = vNg tuplesand instances of SQ, possibly labeled).Observation 1 Any Bayesian net B� that encodesthe underlying distribution p( � ), will in fact producethe optimal performance; i.e., err(B� ) will be optimal.

(However, the converse is not true: there could be netswhose performance is perfect on the queries that inter-est us, i.e., err(B� ) = 0, but which are otherwise verydi�erent from the underlying distribution.)From this observation we see that, if we have a learn-ing algorithm that produces better and better approxi-mations to p( � ) as it sees more training examples, thenin the limit the sq( � ) distribution becomes irrelevant.Given a small set of examples, however, the sq( � ) dis-tribution can play an important role in determiningwhich BN is optimal. This section considers both thecomputational and sample complexity of this under-lying task. Subsection 3.1 �rst considers the simpletask of evaluating a given network, as this informa-tion is often essential to learning a good BN. Subsec-tion 3.2 then analyses the task of �lling in the optimalCP-tables for a given graphical structure, and Subsec-tion 3.3 discusses a hill-climbing algorithm for �llingthese tables, to produce a BN whose accuracy is locallyoptimal.3.1 COMPUTING err(B )It is easy to compute the estimate KLD(B ) of KL(B ),based on examples of complete tuples D drawn fromthe p( � ) distribution. In contrast, it is hard to com-pute the estimate errQ(B ) of err(B ) from generalstatistical queries | in fact, it is not even easy toapproximate this estimate.Theorem 2 ([Rot96, DL93]) It is #P -hard3 tocompute errQ(B ) over a set of general queries Q �SQ. It is NP-hard to even estimate this quantity towithin an additive factor of 0:5.The reason is that evaluating the score for an arbi-trary Bayesian network requires evaluating the poste-rior probabilities of events in Q, which is known tobe di�cult in general. In fact, this is hard even if weknow the distribution p( � ) and consider only a single(known) form for the query.Note, however, that this computation is much easierin the SQB case, because there is an trivial way toevaluate a Bayesian net on any Markov-blanket query[Pea88]; and hence to compute the score.There is an obvious parallel between estimatingerrQ0(B ) when dealing with SQB queries Q0, and es-timating KLD0(B ) from complete tuples D0 [Hec95]:both tasks are quite straightforward, basically becausetheir respective Bayesian net computations are simple.Similarly, it can be challenging to compute errQ(B )3Roughly speaking, #P is the class of problems corre-sponding to counting the number of satis�able assignmentsto a satis�ability problem, and thus #P -hard problems areat least as di�cult as problems in NP.

in the general SQ case, or to estimate KLD(B ) fromincomplete tuples D [CH92, RBKK95]; as here theBayesian net computations are inherently intractable.We will see these parallels again below.Another challenge is computing the sample complexityof gathering the information required to compute thescore for a network. It is easy to collect a su�cientnumber of examples if we are considering learning fromlabeled statistical queries. Here, a simple applicationof Hoe�ding's Inequality [Hoe63] shows4Theorem 3 LeterrQ(B ) = 1MLSQ Xhq;pi2SLSQ(B( q )� p)2be the empirical score of the Bayesian net B, based ona set SLSQ ofMLSQ = MLSQ(�; �) = 12�2 ln 2�labeled statistical queries, drawn randomly from thesq( � ) distribution and labeled by p( � ). Then, withprobability at least 1��, errSLSQ(B ) will be within � oferr(B ); i.e., P [ jerrSLSQ(B )�err(B )j < � ] � 1��,where this distribution is over all sets of MLSQ(�; �)randomly drawn statistical queries.A more challenging question is: What if we only getunlabeled queries, together with examples of the un-derlying distribution? Fortunately, we again need onlya polynomial number of (unlabeled) query examples.Unfortunately, we need more information before wecan bound on the number of event examples required.To see this, imagine sq( � ) puts all of the weight on asingle query, i.e., sq(X = 1 ; Y = 1 ) = 1. Hence,a BN's accuracy depends completely on its perfor-mance on this query, which in turn depends criticallyon the true conditional probability p(X = 1 jY = 1 ).The only event examples relevant to estimating thisquantity are those with Y = 1; of course, these ex-amples only occur with probability p(Y = 1 ). Un-fortunately, this probability can be arbitrarily small.Further, even if p(Y = 1 ) � 0, the true value ofp(X = 1 jY = 1 ) can still be large (e.g., if X isequal to Y , then p(X = 1 jY = 1 ) = 1, even ifp(Y = 1 ) = 1=2n). Hence, we cannot simply ignoresuch queries (as sq(X=x ; Y=y ) can be high), norcan we assume the resulting value will be near 0 (asp(X=x jY=y ) can be high).We can still estimate the score of a BN, in the followingon-line fashion:Theorem 4 First, let SSQ = fsq(xi ; yi )gi be a setof MSQ(�; �) = 2�2 ln 4�4Proofs for all new theorems, lemmas and corollariesappear in [GGS97].

unlabeled statistical queries drawn randomly from thesq( � ; � ) distribution. Next, let SD be the set of (com-plete) examples sequentially drawn from the underlyingdistribution p( � ), until it includes at leastM 0D(�; �) = 8�2 ln 2MSQ�instances that match each yi value; notice SD mayrequire many more than M 0D examples. (The \legalquery" requirement p(yi ) > 0 insures that this col-lection process will terminate, with probability 1.) Fi-nally, let p(SD)(xi jyi), be the empirically observed es-timate of p(xi jyi ), based on this SD set. Then, withprobability at least 1� �,errSSQ;SD(B) = 1jSSQj Xhx;yi2SSQhB(x jy )� p(SD)(xjy)i2will be within � of err(B ); i.e., P [ jerrSSQ;SD(B ) �err(B )j < � ] � 1� �.We can, moreover, get an a priori bound on the totalnumber of event examples if we can bound the proba-bility of the query's conditioning events . That is,Corollary 5 If we know that all queries encountered,sq(x ; y ), satisfy p(y ) � � for some � > 0, then weneed only gatherMD(�; �; �) =maxn 2� hM 0D + ln 4MSQ� i ; 8�2 ln 4MSQ� ocomplete event examples, along withMSQ(�; �) = 2�2 ln 4�example queries, to obtain an �-close estimate, withprobability at least 1� �.Of course, as � can be arbitrarily small (e.g., o(1=2n)or worse), this MD bound can be arbitrarily large,in terms of the size of the Bayesian net. Note alsothat the Friedman and Yakhini [FY96] bound similarlydepends on \skewness" of the distribution, which theyde�ne as the smallest non-zero probability of an event,over all atomic events.5Two �nal comments: (1) Recall that these bounds de-scribe only how many examples are required; not howmuch work is required, given this information. Unfor-tunately, using these examples to compute the scoreof a BN requires solving a #P -hard problem; see The-orem 2. (2) The sample complexity results hold forestimating the accuracy of any system for represent-ing arbitrary distributions; not just BNs.5H�o�gen [H�of93] was able to avoid this dependency, incertain \log-loss" contexts, by \tilting" the empirical dis-tribution to avoid 0-probability atomic events. That trickdoes not apply to our query-based error measure.

3.2 COMPUTING OPTIMAL CP-tablesFOR A GIVEN NETWORKSTRUCTUREThe structure of a Bayesian net, in essence, speci�eswhich variables are directly related to which others.As people often know this \causal" information (atleast approximately), many BN-learners actually be-gin with a given structure, and are expected to usetraining examples to \complete" the BN, by �lling inthe \strength" of these connections | i.e., to learnthe CP-table entries. To further motivate this task of\�tting" good CP-tables to a given BN structure, notethat it is often the key sub-routine of the more generalBN-learning systems, which must also search throughthe space of structures. This subsection addresses boththe computational, and sample, complexity of �ndingthis best (or near best) CP-table. Subsection 3.3 nextsuggests a more practical, heuristic approach.Stated more precisely, the structure of a speci�cBayesian net is a directed acyclic graph hV ; Ei withnodes V and edges E � V � V . There are, of course,(uncountably) many BNs with this structure, corre-sponding to all ways of �lling in the CP-tables. LetBN (V ; E) denote all such BNs.We now address the task of �nding a BN B 2BN (V ; E) whose score is, with high probability, (near)minimal among this class; i.e., �nd B such thaterr(B ) � � + minB02BN (V;E) err(B0 )with probability at least 1��, for small �; � > 0. As inSubsection 3.1, our learner has access to either labeledstatistical queries drawn from the query distributionsq( � ) over SQ; or unlabeled queries from sq( � ), to-gether with event examples drawn from p( � ).Unfortunately this task | like most other other in-teresting questions in the area | appears computa-tionally di�cult in the worst case. In fact, we provebelow the stronger result that �nding the (truly) opti-mal Bayesian net is not just NP-hard, but is actuallynon-approximatable:Theorem 6 Assuming P 6= NP , no polynomial-time algorithm (using only labeled queries) can com-pute the CP-tables for a given Bayesian net structurewhose error score is within a su�ciently small addi-tive constant of optimal. That is, given any structurehV ; Ei and a set of labeled statistical queries Q, letBhV;Ei;Q 2 BN (V ; E) have the minimal error over Q;i.e., 8B0 2 BN (V ; E); errQ(BhV;Ei;Q ) � errQ(B0 ).Then (assuming P 6= NP ) there is some > 0such that no polynomial-time algorithm can always�nd a solution within of optimal, i.e., no poly-time algorithm can always return a B00hV;Ei;Q such thaterrQ(B00hV;Ei;Q )� errQ(BhV;Ei;Q ) � .In contrast, notice that the analogous task is trivial in

the log-likelihood framework: Given complete train-ing examples (and some benign assumptions), the CP-table that produces the optimal maximal-likelihoodBN corresponds simply to the observed frequency es-timates [Hec95].However, the news is not all bad in our case. Althoughthe problem may be computationally hard, the samplecomplexity can be polynomial. That is (under certainconditions; see below), if we draw a polynomial num-ber of labeled queries, and (somehow!) �nd the BN Bthat gives minimal error for those queries, then withhigh probability B will be within � of optimal over thefull distribution sq( � ).We conjecture that the sample complexity result istrue in general. However, our results below uses thefollowing annoying, but extremely benign, technicalrestriction. Let T = fy j sq(x ; y ) > 0g be the setof all conditioning events that might appear in queries(often T will simply be the set of all events). For anyc > 1, de�neBNT �1=2cN (V ; E) =fB 2 BN (V ; E) j 8y 2 T ; B(y ) > 1=2cNgto be the subset of BNs that assign, to each condi-tioning event, a probability that is bounded below bythe doubly-exponentially small number 1=2cN . (Re-call that N = jVj, the number of variables.) We nowrestrict our attention to these Bayesian nets.6Theorem 7 Consider any Bayesian net structurehV ; Ei, requiring the speci�cation of K CP-tableentries CPT = f[qijri]gi=1::K . Let B� 2BN T �1=2cN (V ; E) be the BN that has minimum em-pirical score with respect to a sample ofM 0LSQ(�; �) =14�2 � log 2� + K log 2K� + NK log(2 + c� log �)�labeled statistical queries from sq( � ). Then, with prob-ability at least 1� �, B� will be no more than � worsethan Bopt, where Bopt is the BN with optimal scoreamong BNT �1=2cN (V ; E) with respect to the full dis-tribution sq( � ).This theorem is nontrivial to prove, and in particularis not an immediate corollary to Theorem 3. Thatearlier result shows how to estimate the score for a6Conceivably | although we conjecture otherwise |there could be some sets of queries and some graphs hV; Ei,such that the best performance is obtained with extremelysmall CP-table entries; e.g., of order o(1=222n ). (But notethat numbers this small can require doubly-exponentialprecision just to write down, so such BNs would perhaps beimpractical anyway!) Our result assumes that, even if suchBNs do allow improved performance, we are not interestedin them.

single �xed BN, allowing � probability of error. Butsince B� is chosen after the fact (i.e., to be optimal onthe training set) we cannot have the same con�dencethat we have estimated its score correctly. Instead,we must use su�ciently many examples so that thesimultaneously estimated scores for all (uncountablymany) B0 2 BNT �1=2cN (V ; E) are all within � of thetrue values, with collective probability of at most �that there is any error. Only then can we be con�dentabout B�'s accuracy. (See proof in [GGS97].)As in Section 3.1, we can also consider the slightlymore complex task of learning the CP-table entriesfrom unlabeled statistical queries sq(X= x ; Y = y ),augmented with examples of the underlying distribu-tion p( � ). However, as above, this is a straightfor-ward extension of the \learning from labeled statisti-cal query" case: one �rst draws a slightly larger sam-ple of unlabeled statistical queries, and then uses asu�cient sample of domain tuples to accurately esti-mate the labels for each of these queries (hence sim-ulating the e�ect of drawing fewer | but still su�-ciently many | labeled statistical queries). Here weencounter the same caveats that each of the unlabeledstatistical queries sq(X=x ; Y=y ) must involve con-ditioning eventsY=y that occur with some nontrivialprobability p(Y=y ) > 0 (for otherwise one could notput an nontrivial upper bound on the number of tu-ples needed to learn a good setting of the CP-tableentries). A detailed statement and proof of this re-sult is a straightforward extension of Theorem 7, sowe omit the details here. (See [GGS97].)The point is that, from a sample complexity perspec-tive, it is feasible to learn near optimal settings for theCP-table entries in a �xed Bayesian network structureunder our model. The only di�cult part is that actu-ally computing these optimal entries from (a polyno-mial number of) training samples is hard in general;cf., Theorem 6. In fact, we will see, in Section 4, thatit is not correct to simply �ll each CP-table entry withthe frequency estimates.3.3 HILL CLIMBINGIt should not be surprising that �nding the optimalCP-tables was computationally hard, as this problemhas a lot in common with the challenge of learning theKL( � )-best network, given partially speci�ed tuples;a task for which people often use iterative steepest-ascent climbing methods [RBKK95]. We now brie yconsider the analogous approach in our setting.Given a single labeled statisti-cal query \hx; y; p(x jy )i", consider how to changethe value of the CP-table entry [qjr], whose currentvalue is eqjr. We use the following lemma:Lemma 8 Let B be a Bayesian net whose CP-tableincludes the value eQ=qjR=r = eqjr 2 [0; 1] as thevalue for the conditional probability of Q = q givenR = r. Let sq(X ; Y ) be a statistical query, to which

B assigns the probability B(X jY ). Then the deriva-tive of B(X jY ), wrt the value eqjr, isdB(X jY )d eqjr= 1eqjrB(X jY ) [B( q; r jX;Y )�B( q; r jY )] (2)As B produced the score B(X jY ) here, the error forthis single query is errhX;Yi(B ) = (B(X jY ) � p)2.To compute the gradient of this error value, as a func-tion of this single CP-table entry (using Equation 2),d errhX;Yi(B )d eqjr = 2(B(X jY )� p)dB(X jY )d eqjrthus, letting C = 2(B(X jY )� p), we getC dB(X jY )d eqjr= Ceqjr [B( q; r;X jY )�B(X jY )B( q; r jY )]= CeqjrB(X jY ) [B( q; r jX;Y )�B( q; r jY )] (3)We can then use this derivative to update the eqjrvalue, by hill-climbing in the direction of the gradi-ent (i.e., gradient ascent.). Of course, Equation 3 pro-vides only that component of the gradient derived froma single query; the overall gradient for eqjr will in-volve summing these values of all queries (or perhapsall queries in some sample). Furthermore, we mustconstrain the gradient ascent to only move such thatP0q eq0jr remains as 1 (i.e., the sum of probabilities ofall possible values for Q, given a particular valuationfor Q's parents, must sum to 1). However, the tech-niques involved are straightforward and well-known,so we omit further analysis here.Notice immediately from Equation 3 that we willnot update eqjr (at least, not because of the querysq(X ; Y )) if the di�erence B(X jY )� p is 0 (i.e., ifB(X jY ) is correct) or if B( q; r jX;Y )�B( q; r jY )is 0 (i.e., if Y \d-separates"X and q; r); both of whichmakes intuitive sense.Unfortunately, we see that evaluating the gradient re-quires computing conditional probabilities in a BN.This is analogous to to the known result in the tra-ditional model [RBKK95]. It thus follows that it canbe #P -hard to evaluate this gradient in general (seeTheorem 2). However, in special cases | i.e., BNs forwhich inference is tractable | e�cient computation ispossible.One demonstration of this is the class of \Markov-blanket queries" SQB (recall the de�nition at the endof Section 2). Carrying out the gradient computationis easy in this case: Here when updating the [qjr] entry,we can ignore queries sq(X ; Y ) if [qjr] is outside ofits Markov blanket. We therefore need only considerthe queries sq(X ; Y ) where fQg [ R � X [ Y andmoreover, when Q = q is consistent with X's assign-ment; for these queries, the gradient is

d errhX;Yi(B )d eqjr= 2(B(X jY )� p)eqjr B(X jY ) (1�B(X jY )) (4)which follows from Equation 3 using B( q; r jX;Y ) =1 as q; r is consistent with X's claim (recall we ignoresq(X ; Y ) otherwise), and observing that B( q; r jY )reduces to B(X jY ), as the part of fQg [R alreadyin Y is irrelevant. Notice Equation 4 is simple to com-pute, as it involves no non-trivial Bayesian net com-putation; see the simple algorithms in [Pea88].4 HOW CAN THE QUERYDISTRIBUTION HELP?Our intuition throughout this paper is that having ac-cess to the distribution of queries should allow us tolearn better and more e�ciently than if we only get tosee domain tuples alone. Is this really true?Note that the simplest and most standard approachto learning CP-table entries is simply �lling in eachCP-table entry with the observed frequency estimates[OFE] obtained from p( � ). Note that this ignoresany information about the query distribution. Unfor-tunately, OFE is not necessarily a good idea in ourmodel, even if we have an arbitrary number of ex-amples. This follows immediately from Theorem 6:If the standard OFE algorithm was always success-ful, then we would have a trivial polynomial time al-gorithm for computing a near-optimal CP-table for a�xed Bayesian net structure | which cannot be (un-less P = NP ). Yes, Observation 1 does claim that theoptimal BN is a faithful model of the event distribu-tion, meaning in particular that the value of each CP-table entry [qjr] can be �lled with the true probabilityp( q j r ). However, this claim is not true in generalin the current context, where we are seeking the bestCP-table entries for a given network structure, as thisnetwork structure might not correspond to the trueconditional independence structure of the underlyingdistribution p( � ).In the case where the BN structure does not corre-spond to the true conditional independence structureof the underlying p( � ), ignoring the query distributionand using straight OFE can lead to arbitrarily badresults:Example 4.1 Suppose the BN structure is simplyA ! X ! C, and the only labeled queries arehC; A; 1:0i and hC; �A; 0:0i. (I.e., A � C with proba-bility 1, and we are only interested in querying C givenA or �A.) Suppose further that the intervening X iscompletely independent of A and C | i.e., p(X jA ) =p(X j :A ) = p(C jX ) = p(C j :X ) = 0:5. (Note thatthis BN structure is seriously wrong.)

In this situation, the BN that most faithfully followsthe event distribution, Bp, would have CP-table en-tries eXjA = eXj �A = eCjX = eCj �X = 0:5, with a per-formance score of err(Bp ) = 0:25. (Recall that eXjAis the CP-table entry that speci�es the probability ofX, given that A holds; etc.) Now consider Bsq, whoseentries are eXjA = eCjX = 1:0 and eXj �A = eCj �X = 0:0| i.e., make X � A and C � X. While Bsq clearlyhas the X-dependencies completely wrong, its score isperfect, i.e., err(Bsq ) = 0:0.7Thus, �lling CP-table entries with observed frequencyestimates | or using any other technique that con-verges to Bp | leads to a bad solution in this case,no matter how many training examples are used. Onthe other hand, consider a learning procedure that(knowing the query distribution!) ignores the X vari-able completely and directly estimates the conditionalprobabilities p(A jC ) and p(A j :C ) before �lling inthe CP-table entries (i.e., which is isomorphic to Bsq).This would eventually learn a perfect classi�er. Ofcourse, such a procedure might have to be based on the(impractical) learning techniques developed in Theo-rem 7, or perhaps (more realistically) use the heuristichill-climbing strategies presented in Section 3.3.What about the case when the proposed networkstructure is correct? Here we know that the standardOFE approach eventually does converge to an optimalCP-table setting for any query distribution (Observa-tion 1). So, unlike the case of an incorrect structure,there is no reason in the large-sample-size limit to con-sider the query distribution. But what about the morerealistic situation, where the sample is �nite? Thequestion then is:Given that the known structure is correct,can we exploit knowing the true query distribution?There is one simple sense in which the answer cancertainly be yes. It is clearly safe to to restrict ourattention to those nodes of the BN that are not d-separated from every query variable by the condition-ing variables that appear in the queries. That is, ifa BN B contains the edge U ! V , and the query7The same issue is relevant to understanding the re-striction in Theorem 7 to BN T �1=2cN (V; E). One mightconsider removing this restriction by assuming that all con-ditioning events y 2 T have signi�cant probability accord-ing to p( � ); i.e., they are not too unlikely (�a la Corol-lary 5). But Theorem 7 does not make this assumption,for an important reason. Note that if we are not directlyinterested in queries about p(y ), then the BN we use toanswer queries is not constrained to agree with p(y ). Inparticular, if it helps to get better answers on the queriesthat do occur, the optimal BN Bopt could (conceivably)\set" Bopt(y) to be extremely small; knowing that p(y )is perhaps large is just irrelevant. Our theorem, which in-stead assumes that the former quantity is not too small,simply would not be helped by (what might seem to bemore natural) restrictions on p(y ).

�� = � ?ZZZZZZZ~C

A1 A2 A3 � � � AnFigure 1: Bayesian network structure for Example 4.2(\Naive Bayes").distribution sq(X = x ; Y = y ) is such that, for ev-ery query \p(X=x jY=y ) = ?", both U and V ared-separated from X by Y, then we know that the CP-table entry evju cannot a�ect B's answer to the query,B(X jY ). Thus, it seems clear that we do not needto bother estimating evju here. Now suppose we havea learning algorithm that uses a computed sample sizebound (which grows with the number of parametersto be estimated) in order to provide certain perfor-mance guarantees. Here, our knowledge of the querydistribution will reduce the e�ective size of the BN,which will allow us to stop learning after fewer sam-ples. Thus, using the query distribution can give anadvantage here, although only in a rather weak sense:the basic learning technique might still amount to �ll-ing in the CP-table entries with frequency estimatesobtained from the underlying distribution p( � ) | theonly win is that we will know that it is safe to stopearlier because a small fragment of the network is rel-evant.Can one do better than simply �lling in CP-tableentries with frequency estimates, given that the BNstructure is correct? As we now show, this questiondoes not seem to have a simple answer.Motivated by Example 4.1, one might ignore the BN-structure in general, and just directly estimate the con-ditional probabilities for the queries of interest. Notethat this is guaranteed to converge to an optimal solu-tion, eventually , even if the BN structure is incorrect.However, it can be needlessly ine�cient in some cases,especially if the postulated BN structure is known tobe correct. This is because the BN structure can pro-vide valuable knowledge about the distribution.Example 4.2 Consider the standard \Naive Bayes"model with n+ 1 binary attributes C;A1; : : : ; An suchthat the Ai are conditionally independent given C;see Figure 1. Suppose the single query of interest is\p(C = 0 jA1 = 0; A2 = 0; : : : ; An = 0 ) = ?". Ifwe attempt to learn this probability directly, we mustwait for the (possibly very rare) event that A1 = A2 =: : : = An = 0; it is easy to construct situations wherethis will require an exponential (expected) number ofexamples. However, if we use the BN-structure, wecan compute the required probability as soon as wehave learned the 2n + 1 probabilities p(C = 0) and

��

ZZZZZZZ~JJJJJ ? ��=CA1 A2 A3 � � � An

Figure 2: Bayesian network structure for Example 4.3(\Reverse Naive Bayes").p(Ai = 0jC = 0); p(Ai = 0jC = 1) for all i. Ifp(C = 0) is near 1=2, these probabilities will all belearned accurately after relatively few samples.This example might suggest that we should always tryto learn the BN as accurately as possible, and ignorethe query distribution (e.g., just use OFE). However,there are other examples in which this approach wouldhurt us:Example 4.3 Consider the \reverse" BN-structurefrom Example 4.2, where the arrows are now directedAi ! C instead of C ! Ai (Figure 2), and assume weare only interested in queries of the form \p(C j fg ) =?". Here the strategy that estimates p(C = c ) di-rectly (hence, ignoring the given BN-structure) domi-nates the standard approach of estimating the CP-tableentries. To see this, note that for any reasonable train-ing sample size N � 2n, the frequency estimates formost of the 2n CP-table entries ecja1;:::;an will be un-de�ned. Even using Laplace adjustments to compen-sate for this, the accuracy of the resulting BN estimateB(C = cjfg) will be poor unless p(C = c ) happensto be near 0.5. So, for example, if the true distribu-tion p(�) is such that C � A1 ^ parity(A2; :::; An); andp(Ai = 1 ) = 0:5, then p(C = 1 ) will be equal to0.25. But here the BN estimator (using Laplace ad-justments) will give a value of B(C = 1jfg) � 0:5 forany training sample size N � 2n, whereas the directstrategy will quickly converge to an accurate estimateP (C = 1) � 0:25. (Note that the standard maximumlikelihood BN estimator will not even give a de�nedvalue for B(C = 1jfg) in this case, since most of thepossible a1; :::; an patterns will remain unobserved.)In general, the questionWhat should we actually do if the BN structureis known (or assumed) to be correct, and we aretraining on a (possibly small) sample of completeinstances?remains open, and is an interesting direction for futurework | asking, in essence, should we trust the givenBN-structure, or the given query distribution? Theprevious two examples suggest that the answer is nota trivial one.

5 CONCLUSIONSRemaining Challenges: There are of course severalother obvious open questions.First, the analyses above assume that we had the BN-structure, and simply had to �ll in the values of theCP-tables. In general, of course, we may have to usethe examples to learn that structure as well. The obvi-ous approach is to hill-climb in the discrete, but combi-natorial space of BN structures, perhaps using a sys-tem like palo [Gre96], after augmenting it to climbfrom one structure Si to a \neighboring" structureSi+1, if Si+1, �lled with some CP-table entries, appearsbetter than Si with (near) optimal CP-table values,over a distribution of queries. Notice we can often savecomputation by observing that, for any query q, B1and B2 will give the same error scores B1( q ) = B2( q )if the only di�erences between B1 and B2 are outsideof q's Markov blanket.The second challenge is how best to accommodate bothtypes of examples: queries (possibly labeled), and do-main tuples. As discussed above, sq( � ) examples areirrelevant given complete knowledge of p( � ) (Obser-vation 1). Similarly, given complete knowledge of thequery distribution, we only need p( � ) information toprovide the labels for the queries.Of course, these extreme conditions are seldom met;in general, we only have partial information of eitherdistribution. Further, Example 4.1 illustrates thatthese two corpora of information may lead to di�er-ent BNs. We therefore need some measured way ofcombining both types of information, to produce a BNthat is both appropriate for the queries that have beenseen, and for other queries that have not | even ifthis means degrading the performance on the observedqueries. (That is, the learner should not \over�t" thelearned BN to just the example queries it has seen; itshould be able to \extend" the BN based on the eventdistribution.)Contributions: As noted repeatedly in MachineLearning and elsewhere, the goal of a learning algo-rithm should be to produce a \performance element"that will work well on its eventual performance task[SMCB77, KR94]. This paper considers the task oflearning an e�ective Bayesian net within this frame-work, and argues that the goal of a BN-learner shouldbe to produce a BN whose error, over the distributionof queries, is minimal.Our results show that many parts of this task are, un-fortunately, often harder than the corresponding taskswhen producing a BN that is optimal in more familiarcontexts (e.g., maximizing likelihood over the sampledevent data) | see in particular our hardness results forevaluating a BN by our criterion (Theorem 2), for �ll-ing in a BN-structure's CP-table (Theorem 6), and forthe steps used by the obvious hill-climbing algorithmtrying to instantiate these tables. (Note, however, that

Learning Structure is Computational Correct convergence Small sample\Algorithm" E�ciency (in the limit)OFE� Correct \easy" Yes [Obs 1] >?QD [Ex 4.2]��Incorrect No [Ex 4.1]QDy Correct NP -hard to approx. [Th 6] Yes >?OFE [Ex 4.3]��Incorrect Yesz� OFE = Observed Frequency Estimates (or any other algorithm that tries to match the event distribution.)y QD uses the CP-table that is \best" for given query distribution, using samples from the distribution tolabel queries.z QD will produce the BN that has minimum error, for this structure.�� Our examples illustrate cases in which one \algorithm" (OFE, QD) is more sample e�cient than the other.Table 1: Issues when Learning from Distributional Samplesour approach is robustly guaranteed to converge toa BN with optimal performance, while those alterna-tive techniques are not.) Fortunately, we have foundthat the sample requirements are not problematic forour tasks (see Theorems 3, 4, 7 and Corollary 5),given various obvious combination of example types;we also identify a signi�cant subclass of queries (SQB)in which some of these tasks are computationally easy.We have also compared and contrasted our proposedapproach to �lling in the CP-table-entries with thestandard \observed frequency estimate" method, andfound that there are many subtle issues in decidingwhich of these \algorithms" works best, especially inthe small-sample situation. These results are summa-rized in Table 1. We plan further analysis (both theo-retical and empirical) towards determining when thismore standard measure, now seen to be computation-ally simpler, is in fact an appropriate approximationto our performance-based criteria.References[Bun96] Wray Buntine. A guide to the literature onlearning probabilistic networks from data. IEEETransactions on Knowledge and Data Engineering,1996.[CH92] G. Cooper and E. Herskovits. A Bayesianmethod for the induction of probabilistic networksfrom data. Machine Learning, 9:309{347, 1992.[DL93] P. Dagum and M. Luby. Approximating prob-abilistic inference in Bayesian belief networks is NP-hard. Arti�cial Intelligence, 60:141{153, April 1993.[FG96] Nir Friedman and Moises Goldszmidt. Build-ing classifers using Bayesian networks. In Proc.AAAI-96, 1996.[FY96] Nir Friedman and Z. Yakhini. On the samplecomplexity of learning Bayesian networks. In Proc.Uncertainty in AI, 1996.[GGS97] Russell Greiner, Adam Grove, and DaleSchuurmans. Learning Bayesian nets that performwell. Technical report, Siemens Corporate Research,1997.

[Gre96] Russell Greiner. PALO: A probabilistic hill-climbing algorithm. Arti�cial Intelligence, 83(1{2),July 1996.[Hec95] David E. Heckerman. A tutorial on learningwith Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research, 1995.[Hoe63] Wassily Hoe�ding. Probability inequalitiesfor sums of bounded random variables. Journal ofthe American Statistical Association, 58(301):13{30,March 1963.[H�of93] Klaus-U. H�o�gen. Learning and robust learn-ing of product distributions. In Proc. COLT-93,pages 77{83, 1993 1993. ACM Press.[Kea93] M. Kearns. E�cient noise-tolerant learningfrom statistical queries. In Proc. STOC-93, pages392{401, 1993.[KR94] Roni Khardon and Dan Roth. Learning toreason. In Proc. AAAI-94, pages 682{687, 1994.[LB94] Wai Lam and Fahiem Bacchus. Learn-ing Bayesian belief networks: An approach basedon the MDL principle. Computation Intelligence,10(4):269{293, 1994.[Pea88] Judea Pearl. Probabilistic Reasoning in In-telligent Systems: Networks of Plausible Inference.Morgan Kaufmann, 1988.[RBKK95] Stuart Russell, John Binder, DaphneKoller, and Keiji Kanazawa. Local learning in prob-abilistic networks with hidden variables. In Proc.IJCAI-95, Montreal, Canada, 1995.[Rot96] D. Roth. On the hardness of approximate rea-soning. Arti�cial Intelligence, 82(1{2), April 1996.[Sch78] G. Schwartz. Estimating the dimension of amodel. Annals of Statistics, 6:461{464, 1978.[SMCB77] Reid G. Smith, Thomas M. Mitchell,R. Chestek, and Bruce G. Buchanan. A model forlearning systems. In Proc. IJCAI-77, pages 338{343,MIT. Morgan Kaufmann.

Learning Bayesian nets that perform well

Documents

Transcript of Learning Bayesian nets that perform well