ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS: APPLICATION TO MICROCALCIFICATION...

18
CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS: APPLICATION TO MICROCALCIFICATION DETECTION IN BREAST CANCER DIAGNOSIS Juan Ignacio Arribas, Jesu ´s Cid-Sueiro, and Carlos Alberola-Lo ´pez 3.1 INTRODUCTION Neural networks (NNs) are customarily used as classifiers aimed at minimizing classification error rates. However, it is known that the NN architectures that compute soft decisions can be used to estimate posterior class probabilities; sometimes, it could be useful to implement general decision rules other than the maximum a posteriori (MAP) decision criterion. In addition, probabilities provide a confidence measure of the classifier decisions, a fact that is essential in applications in which a high risk is involved. This chapter is devoted to the general problem of estimating posterior class probabil- ities using NNs. Two components of the estimation problem are discussed: Model selec- tion, on one side, and parameter learning, on the other. The analysis assumes an NN model called the generalized softmax perceptron (GSP), although most of the discussion can be easily extended to other schemes, such as the hierarchical mixture of experts (HME) [1], which has inspired part of our work, or even the well-known multilayer perceptron. The use of posterior probability estimates is applied in this chapter to a medical decision support system; the testbed used is the detection of microcalcifications (MCCs) in mammograms, which is a key step in breast cancer early diagnosis. The chapter is organized as follows: Section 3.2 discusses the estimation of posterior class probabilities with NNs, with emphasis in a medical application; Section 3.3 discusses learning and model selection algorithms for the GSP networks; Section 3.4 proposes a system for MCC detection based on the GSP; Section 3.5 shows some simulation results on detection performance using a mammogram database; and Section 3.6 provides some conclusions and future trends. Handbook of Neural Engineering. Edited by Metin Akay Copyright # 2007 The Institute of Electrical and Electronics Engineers, Inc. 41

Transcript of ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS: APPLICATION TO MICROCALCIFICATION...

C H A P T E R3ESTIMATION OF POSTERIOR

PROBABILITIES WITH NEURAL

NETWORKS: APPLICATION TO

MICROCALCIFICATION

DETECTION IN BREAST

CANCER DIAGNOSISJuan Ignacio Arribas, Jesus Cid-Sueiro, and

Carlos Alberola-Lopez

3.1 INTRODUCTION

Neural networks (NNs) are customarily used as classifiers aimed at minimizing

classification error rates. However, it is known that the NN architectures that compute

soft decisions can be used to estimate posterior class probabilities; sometimes, it could

be useful to implement general decision rules other than the maximum a posteriori

(MAP) decision criterion. In addition, probabilities provide a confidence measure of the

classifier decisions, a fact that is essential in applications in which a high risk is involved.

This chapter is devoted to the general problem of estimating posterior class probabil-

ities using NNs. Two components of the estimation problem are discussed: Model selec-

tion, on one side, and parameter learning, on the other. The analysis assumes an NN model

called the generalized softmax perceptron (GSP), although most of the discussion can be

easily extended to other schemes, such as the hierarchical mixture of experts (HME) [1],

which has inspired part of our work, or even the well-known multilayer perceptron.

The use of posterior probability estimates is applied in this chapter to a medical

decision support system; the testbed used is the detection of microcalcifications

(MCCs) in mammograms, which is a key step in breast cancer early diagnosis.

The chapter is organized as follows: Section 3.2 discusses the estimation of posterior

class probabilities with NNs, with emphasis in a medical application; Section 3.3 discusses

learning and model selection algorithms for the GSP networks; Section 3.4 proposes a

system for MCC detection based on the GSP; Section 3.5 shows some simulation

results on detection performance using a mammogram database; and Section 3.6 provides

some conclusions and future trends.

Handbook of Neural Engineering. Edited by Metin AkayCopyright # 2007 The Institute of Electrical and Electronics Engineers, Inc.

41

3.2 NEURAL NETWORKS AND ESTIMATION OFA POSTERIORI PROBABILITIES INMEDICAL APPLICATIONS

3.2.1 Some Background Notes

A posteriori (or posterior) probabilities, or generally speaking, posterior density functions,

constitute a well-known and widely applied discipline in estimation and detection theory

[2, 3] and, more widely speaking, in pattern recognition. The concept underlying pos-

teriors is to update one’s prior knowledge about the state of nature (with the term

nature adapted to the specific problem one is dealing with) according to the observation

of reality that one may have.

Specifically, recall the problem of a binary decision, that is, the need of choosing

either hypothesis H0 or H1 having as information both our prior knowledge about the

probabilities of each hypothesis being the right one [say P(Hi), i ¼ 0, 1] and how the

observation x probabilistically behaves according to each of the two hypotheses [say

f (xjHi), i ¼ 0, 1]. In this case, it is well known that the optimum decision procedure is

to compare the likelihood ratio, that is, f (xjH1)/f (xjH0) with a threshold. The value of

the threshold is, from a Bayesian perspective, a function of the two prior probabilities

and a set of costs (set at designer needs) associated to the two decisions. Specifically

f (xjH1)

f (xjH0)_H1

H0

P(H0)(C10 � C00)

P(H1)(C01 � C11)(3:1)

with Cij the cost of choosing Hi when Hj is correct.

A straightforward manipulation of this equation leads to

P(H1jx)

P(H0jx)_H1

H0

C10 � C00

C01 � C11

(3:2)

with P(Hijx) the a posteriori probability that hypothesis Hi is true, i ¼ 0, 1. For the case

Cij ¼ 1 2 dij, with dij the Kronecker delta function, the equation above leads to the

general MAP rule, which can be rewritten, now extended to the M-ary detection

problem, as

H� ¼ arg maxHi

P(Hijx) i ¼ 1, . . . , M (3:3)

It is therefore clear that posterior probabilities are sufficient statistics for MAP

decision problems.

3.2.2 Posterior Probabilities as Aid to Medical Diagnosis

Physicians are used to working with concepts of sensitivity and specificity of a diagnosis

procedure [4, 5]. Considering H0 the absence of a pathology and H1 the presence of a path-

ology, the sensitivity1 of the test is defined as P(H1jH1) and the specificity of the test is

P(H0jH0). These quantities indicate the probabilities of a test doing fine. However, a

test may draw false positives (FPs), with probability P(H1jH0), and false negatives

(FNs), with probability P(H0jH1).

1In this case, probability P(HijHj) should be read as the probability of deciding Hi when Hj is correct.

42 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS

Clearly, the ideal test is that for which P(HijHj) ¼ dij. However, for real situations,

this is not possible. Sensitivity and specificity are not independent but are related by means

of a function known as receiver operating characteristics (ROCs). The physician should

therefore choose at which point of this curve to work, that is, how to trade FPs (also

known as 1-specificity) and FNs (also known as the sensitivity) according to the

problem at hand.

Two tests, generally speaking, will perform differently. The way to tell whether one

of them outperforms the other is to compare their ROCs. If one ROC is always above the

other, that is, if setting the specificity then the sensitivity of one test is always above that of

the other, then the former is better. Taking this as the optimality criterion, the key question

is the following: Is there a way to design an optimum test? The answer is also well known

[3, 6] and is called the Neyman–Pearson (NP) lemma, and a testing procedure that is built

upon this principle has been called the ideal observer elsewhere [7].

The NP lemma states that the optimum test is given by comparing the likelihood ratio

mentioned before with a threshold. Note that this is operationally similar to the procedure

described in Eq. (3.1). However, what changes now is the value of the threshold. This is

set by assuring that the specificity will take on a particular value, set at the physician’s will.

The reader may then wonder whether a testing procedure built upon a ratio of pos-

terior probabilities, as the one expressed in Eq. (3.2), is optimum in the sense stated in this

section. The answer is yes: The test described in Eq. (3.1) only differs in the point in the

optimum ROC on which it is operating; while a test built upon the NP lemma sets where to

work in the ROC, the test built upon posteriors does not fix where to work. Its motivation is

different: It changes prior knowledge into posterior knowledge by means of the obser-

vations and decides upon the costs given to each decision and each state of nature. This

assures that the overall risk (in a Bayesian sense) is minimum [2].

The question is therefore: Why should a physician use a procedure built upon pos-

teriors? Here are the answers:

. The test indicated in (3.2) also implements and ideal observer [7]. So, no loss of

optimality is committed.

. As we will point out in the chapter, it is simple to build NN architectures that esti-

mate posteriors out of training data. For nonparametric models, the estimation of

both posteriors and likelihood functions is essentially equivalent. However, for para-

metric models, the procedure to estimate posteriors reflects the objectives of the esti-

mation, which is not the case for the estimation of likelihood functions. Therefore, it

seems more natural to use this former approach.

. Working with posteriors, the designer is not only provided with a hard decision

about the presence or absence of a pathology but is also provided with a measure

of confidence about the decision itself. This is an important piece of additional infor-

mation that is also encoded in a natural range for human reasoning (i.e., within the

interval [0, 1]). This reminds us of well-known fuzzy logic procedures often used in

medical environments [8].

. As previously stated, posteriors indicate how our prior knowledge changes accord-

ing to the observations. The prior knowledge, in particular P(H1), is in fact the

prevalence [5] of an illness or pathology. Assume an NN is trained with data

from a region with some degree of prevalence but is then used with patients from

some other region where the prevalence of the pathology is known to be different.

In this case, accepting that symptoms (i.e., observations) are conditionally equally

distributed (CED) regardless of the region where the patient is from, it is simple

to adapt the posteriors from one region to the other without retraining the NN.

3.2 NEURAL NETWORKS AND ESTIMATION OF A POSTERIORI PROBABILITIES 43

Specifically, denote by Hij the hypothesis i(i ¼ 0, 1) in region j( j ¼ 1, 2). Then

P(H1i jx) ¼

f (xjH1i )P(H1

i )

f 1(x)

P(H2i jx) ¼

f (xjH2i )P(H2

i )

f 2(x)

(3:4)

Accepting that symptoms are CED, then f (xjHji ) ¼ f (xjHi). Therefore

P(H20 jx) ¼

f 1(x)

f 2(x)

P(H10 jx)

P(H10)

P(H20) ¼ l

P(H10 jx)

P(H10)

P(H20)

P(H21 jx) ¼ l

P(H11 jx)

P(H11)

P(H21)

(3:5)

where l ¼ f 1(x)=f 2(x). Since

P(H20 jx)þ P(H2

1 jx) ¼ 1 (3:6)

then

lP(H1

0 jx)

P(H10)

P(H20)þ

P(H11 jx)

P(H11)

P(H21)

� �¼ 1 (3:7)

which leads to

P(H20 jx) ¼

½P(H10 jx)=P(H1

0)�P(H20)

½P(H10 jx)=P(H1

0 )�P(H20 )þ ½P(H1

1 jx)=P(H11)�P(H2

1)(3:8)

which gives the procedure to update posteriors using as inputs the posteriors of the

originally trained NN and the information about the prevalences in the two regions,

information that is in the public domain.

The reader should notice that we have never stated that decisions based on posteriors are

preferable to those based on likelihood functions. Actually, the term natural has been used

twice in the reasons to back up the use of posteriors for decision making. This naturalness

should be balanced by the physician to make the final decision. In any case, the procedure

based on posteriors can be very easily forced to work, if needed, as the procedure based on

likelihoods: Simply tune the threshold on the right-hand side of Eq. (3.2) during the train-

ing phase so that some value of specificity is guaranteed.

3.2.3 Probability Estimation with NN

Consider a sample set S ¼ {(xk, dk), k ¼ 1, . . . , K}, where xk [ RN is an observation

vector and dk [ UL ¼ {u0, . . . , uL�1} is an element in the set of possible target classes.

Class i label ui [ RL has all components null but the one at the ith, row, which is unity

(i.e., classes are mutually exclusive).

In order to estimate posterior probabilities in an L-class problem, consider the struc-

ture shown in Figure 3.1, where the soft classifier is an NN computing a nonlinear mapping

gw : x! P, with parameters w, x being the input feature space and P ¼ {y [ ½0,1�Lj0 �

yi � 1,PL

i¼1 yi ¼ 1} a probability space, such that the soft decision satisfies the

44 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS

probability constraints

0 � yj � 1XL

j¼1

yj ¼ 1 (3:9)

Assuming that soft decisions are posterior probability estimates, the hard decisionbdshould be computed according to the decision criterion previously established. For

instance, under the MAP criterion, the hard decision becomes a winner-takes-all

(WTA) function.

Training pursues the calculation of the appropriate weights to estimate probabilities. It

is carried out by minimizing some average, based on samples in S, of a cost function C(y, d).

The first attempts to estimate a posteriori probabilities [9–12] were based on the

application of the square error (SE) or Euclidean distance, that is,

CSE( y, d) ¼XL

i¼1

(di � yi)2 ¼ ky� dk2 (3:10)

It is not difficult to show that, over all nonlinear mappings y ¼ g(x), the one mini-

mizing EfCSE(y, d)g is given by g(x) ¼ Efdjxg. Since Efdijxg ¼ P(Hijx), it is clear that

minimizing any empirical estimate of EfCSEg based on samples in S, estimates of posterior

probabilities are obtained.

Also, the cross entropy (CE), given by

CCE(y, d) ¼ �XL

i¼1

di log yi (3:11)

is known to provide class posterior probability estimates [13–15]. In fact, there is an infi-

nite number of cost functions satisfying this property: Extending previous results from

other authors, Miller et al. [9] derived the necessary and sufficient conditions for a cost

function to minimize to a probability, providing a closed-form expression for these

functions in binary problems. The analysis has been extended to the multiclass case in

[16–18], where it is shown that any cost function that provides posterior probability

estimates—generically called strict sense Bayesian (SSB)—can be written in the form

C(y, d) ¼ h(y)þ (d� y)Tryh(y) (3:12)

where h(y) is any strictly convex function in P which can be interpreted as an entropy

measure. As a matter of fact, this expression is also a sufficient condition: Any cost func-

tion expressed in this way provides probability estimates. In particular, if h is the Shannon

entropy, h(y) ¼ �P

i yi log yi, then Eq. (3.12) becomes the CE, so it can be easily proved

by the reader that both CE and SE cost functions are SSB.

Softclassifier

y

WTAnetwork dx

Figure 3.1 Soft-decision multiple output and WTA network.

3.2 NEURAL NETWORKS AND ESTIMATION OF A POSTERIORI PROBABILITIES 45

In order to choose an SSB cost function to estimate posterior probabilities, one may

wonder what is the best option for a given problem. Although the answer is not clear,

because there is not yet any published comparative analysis among general SSB cost func-

tions, there is some (theoretical and empirical) evidence that learning algorithms based on

CE tend to show better performance than those based on SE [10, 19, 20]. For this reason,

the CE has been used in our simulations shown later.

3.3 POSTERIOR PROBABILITY ESTIMATION BASEDON GSP NETWORKS

This section discusses the problem of estimating posterior probability maps in multiple-

hypotheses classification with the particular type of architecture that has been used in

the experiments: the GSP. First, we present a functional description of the network.

Second, we discuss training algorithms for parameter estimation and, lastly we discuss

the model selection problem in GSP networks.

3.3.1 Neural Architecture

Consider the neural architecture that, for input feature vector x, produces output yi,

given by

yi ¼XMi

j¼1

yij (3:13)

where yij are the outputs of a softmax nonlinear activation function given by

yij ¼exp (oij)PL

k¼1

PMk

m¼1 exp (okm)i ¼ 1, 2, . . . , L j ¼ 1, 2, . . . , Mi (3:14)

and oij ¼ wTijxþ bij, where wij and bij are the weight vectors and biases, respectively, L

is the number of classes, and Mi is the number of softmax outputs aggregated to

compute yi.2

The network structure is illustrated in Figure 3.2. In [16, 17, 21], this network is

named the generalized softmax perceptron (GSP). It is easy to understand that Eqs.

(3.13) and (3.14) ensure that outputs yi satisfy probability constraints given in (3.9).

Since we are interested in the application of MCC detection, we have only con-

sidered binary networks (L ¼ 2) in the simulations. In spite of this, we discuss here the

general multiclass case.

The GSP is a universal posterior probability network, in the sense that, for values of

Mi sufficiently large, it can approximate any probability map with arbitrary precision. This

fact can be easily proved by noting that the GSP can compute exact posterior probabilities

when data come from a mixture model based on Gaussian kernels with the same variance

matrix. Since a Gaussian mixture can approximate any density function, the GSP can

approximate any posterior probability map.

In order to estimate class probabilities from samples in training set S, an SSB cost

function and a search method must be selected. In particular, the stochastic gradient

2Note that, when the number of inputs is equal to 2, the softmax is equivalent to a pair of sigmoidal activation

functions.

46 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS

learning rules to minimize the CE, Eq. (3.11), are given by

wkþ1ij ¼ wk

ij � rkyk

ij

yki

(dki � yk

i )xk (3:15)

bkþ1ij ¼ bk

ij � rkyk

ij

yki

(dki � yk

i ) 1 � i � L 1 � j � Mi (3:16)

where wkij and bk

ij are the weight vectors and biases, respectively, ykij are the softmax outputs

for input x k, dik are the labels for sample xk, and rk is the learning step at iteration k.

In the following section we derive these rules from a different perspective, based on

maximum-likelihood (ML) estimation, which provides an interpretation for the softmax

outputs and some insight into the behavior of the GSP network and which suggests

some ways to design other estimation algorithms.

3.3.2 Probability Model Based on GSP

Since the GSP outputs satisfy the probability constraints, we can say that any GSP

implements a multinomial probability model of the form

P(djW, x) ¼YL

i¼1

(yi)di (3:17)

where matrix W encompasses all GSP parameters (weight vector and biases) and yi is

given by Eqs. (3.13) and (3.14). According to this, we arrive at yi ¼ P(uijW, x). This prob-

abilistic interpretation of the network outputs is useful for training: The model parameters

that best fit the data in S can be estimated by ML as

bW ¼ arg maxW

l(S, W) (3:18)

where

l(S, W) ¼XK

k¼1

log P(dkjW, xk) (3:19)

Following an analysis similar to that of Jordan and Jacobs [1] for HME, this maximi-

zation can be done iteratively by means of the expectation maximization (EM) algorithm.

To do so, let us assume that each class is partitioned into several subclasses. Let Mi

be the number of subclasses inside class i. The subclass for a given sample x will be

max

Soft-y0

y2

y1

w00

w01

w10

w11

w12

w20

x

Figure 3.2 The GSP neural architecture.

3.3 POSTERIOR PROBABILITY ESTIMATION BASED ON GSP NETWORKS 47

represented by a subclass label vector z with components zij [ {0, 1}, i ¼ 1, . . . , L,

j ¼ 1, . . . , Mi, such that one and only one of them is equal to 1 and

di ¼XMi

j¼1

zij 1 � i � L (3:20)

Also, let us assume that the joint probability model of d and z is given by

P(d, zjW, x) ¼ Id, z

YL

i¼1

YMi

j¼1

(yij)zij (3:21)

where Id,z is an indicator function equal to 1 if d and z satisfy Eq. (3.20) and equal to 0

otherwise.

Note that the joint model equation (3.21) is consistent with that in Eq. (3.17), in the

sense that the latter results from the former by marginalization,Xz

P(d, zjW, x) ¼ P(djW, x) (3:22)

Since there is no information about subclasses in S, learning can be supervised at the

class level but must be unsupervised at the subclass level. The EM algorithm based on

hidden variables z k to maximize log-likelihood l(S, W) in (3.19) provides us with a

way to proceed. Consider the complete data set Sc ¼ {(xk, dk, zk), k ¼ 1, . . . , K},

which extends S by including subclass labels. According to Eq. (3.21), the complete

data likelihood is

lc(Sc, W) ¼XK

k¼1

log P(dk, zkjW, xk) ¼XK

k¼1

XL

i¼1

XMi

j¼1

zkij log yk

ij (3:23)

The EM algorithm for ML estimation of the GSP posterior model proceeds, at

iteration t, in two steps:

1. E-step: Compute Q(W, W t) ¼ E{lc(Sc, W)jS, Wt}.

2. M-step: Compute W tþ1 ¼ arg maxW Q(W, W t).

In order to apply the E-step to lc, note that, given S and assuming that true par-

ameters are W t, the only unknown components in lc are the hidden variables. Therefore,

Q(W, Wt) ¼XK

k¼1

XL

i¼1

XMi

j¼1

E{zkijjS, Wt} log yk

ij (3:24)

Let us use compact notation yijkt to refer to the softmax output (relative to class i and

subclass j ) for input x k at iteration t (and the same for zijkt and yi

kt). Then

E{zkijjS, Wt} ¼ E{zk

ijjxk, dk, Wt} ¼ dk

i

yktij

ykti

(3:25)

Substituting (3.25) in (3.24), we get

Q(W, Wt) ¼XK

k¼1

XL

i¼1

dki

XMi

j¼1

yktij

ykti

log ykij (3:26)

48 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS

Therefore, the M-step reduces to

Wtþ1 ¼ arg maxW

XK

k¼1

XL

i¼1

dki

XMi

j¼1

yktij

ykti

log ykij (3:27)

Note that, during the M-step, only yijk depends on W. Maximization can be done in

different ways. For the HME, the iteratively reweighted least squares algorithm [1] or the

Newton–Raphson algorithm [22] has been explored. A more simple solution (that is also

suggested in [1]) consists of replacing the M-step by a single iteration of a gradient

search rule,

Wtþ1 ¼Wt � rtXK

k¼1

XL

i¼1

dki

XMi

j¼1

1

ykti

rWykij

����W¼Wt

(3:28)

which leads to

wtþ1mn ¼ wt

mn � rtXK

k¼1

XL

i¼1

dki

XMi

j¼1

yktij

ykti

(dm�idn�j � yktmn)xk

¼ wtmn � rt

XK

k¼1

yktmn

yktm

(dkm � ykt

m)xk (3:29)

btþ1mn ¼ bt

mn � rtXK

k¼1

yktmn

yktm

(dkm � ykt

m) (3:30)

where rt is the learning step. A further simplification can be done by replacing the

gradient search rule by an incremental version. In doing so, rules (3.15) and (3.16)

result.

3.3.3 A Posteriori Probability Model Selection Algorithm

While the number of classes is fixed and assumed known, the number of subclasses inside

each class is unknown and must be estimated from samples during training: It is a well-

known fact that a high number of subclasses may lead to data overfitting, a situation in

which the cost averaged over the training set is small but the cost averaged over a test

set with new samples is high.

The problem of determining the optimal network size, also known as the model

selection problem, is as well known as, in general, difficult to solve; see [16, 17, 23].

The selected architecture must find a balance between the approximation power of

large networks and the usually higher generalization capabilities of small networks.

A review of model selection algorithms is beyond the scope of this chapter. The

algorithm we propose to determine the GSP configuration, which has been called the

a posteriori probability model selection (PPMS) algorithm [16, 17], belongs to the family

of growing and pruning algorithms [24]: Starting from a pre-defined architecture,

subclasses are added to or removed from the network during learning according to

needs. The PPMS algorithm determines the number of subclasses by seeking a balance

between generalization capability and learning toward minimal output errors.

Although PPMS could be easily extended to other networks, like the HME [1], the

formulation presented here assumes a GSP architecture. The PPMS algorithm combines

pruning, splitting, and merging operations in a similar way to other algorithms proposed

in the literature, mainly for estimating Gaussian mixtures (see [25] or [26] for instance). In

3.3 POSTERIOR PROBABILITY ESTIMATION BASED ON GSP NETWORKS 49

particular, PPMS can be related to the model selection algorithm proposed in [25] where

an approach to automatically growing and pruning HME [1, 27] is proposed, obtaining

better generalization performance than traditional static and balanced hierarchies. We

observe several similarities between their procedure and the algorithm proposed here

because they also compute posterior probabilities in such a way that a path is pruned if

the instantaneous node probability of activation falls below a certain threshold.

The fundamental idea behind PPMS is the following: According to the GSP struc-

ture and its underlying probability model, the posterior probability of each class is a sum of

the subclass probabilities. The importance of a subclass in the sum can be measured by

means of its prior probability,

Ptij ¼ P{zij ¼ 1jWt} ¼

ðP{zij ¼ 1jWt, x}p(x) dx ¼

ðyijp(x) dx (3:31)

which can be approximated using samples in S as3

bPtij ¼

1

K

XK

k¼1

yktij (3:32)

A small value of bPtij is an indicator that only a few samples are being captured by

subclass j in class i, so its elimination should not affect the whole network performance

in a significant way, at least in average terms. On the contrary, a high subclass prior prob-

ability may indicate that the subclass captures too many input samples. The PPMS algor-

ithm explores this hypothesis by dividing the subclass into two halves in order to represent

this data distribution more accurately. Finally, it has also been observed that, under some

circumstances, certain weights in different subclasses of the same class tend to follow a

very similar time evolution to other weights within the same class. This fact suggests a

new action: merging similar subclasses into a unique subclass.

The PPMS algorithm implements these ideas via three actions, called prune, split,

and merge, as follows:

1. Prune Remove a subclass by eliminating its weight vector if its a priori probability

estimate is below a certain pruning threshold mprune. That is,

bPtij , mprune (3:33)

2. Split Add a newsubclass by splitting in two a subclass whose prior probability esti-

mate is greater than split threshold msplit. That is,

bPtij , msplit (3:34)

Splitting subclass (i, j ) is done by removing weight vector wijt and constructing a

pair of new weight vectors wijtþ1 and wij0

tþ1 such that, at least initially, the posterior

3Since Wt is computed based on samples in S, prior probability estimates based on the same sample set are biased.

Therefore, priors should be estimated by averaging from a validation set different from S. In spite of this, in order

to reduce the data demand, our simulations are based on only one sample set.

50 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS

probability map is approximately the same:

wtþ1ij ¼ wt

ij þ D

wtþ1ij0 ¼ wt

ij � D

btþ1ij ¼ btþ1

ij � log 2

btþ1ij0 ¼ btþ1

ij � log 2

(3:35)

It is not difficult to show that, for D ¼ 0, probability maps P(djx, Wt) and

P(djx, Wtþ1) are identical. In order that new weight vectors can evolve differently

during time, using a small nonzero value of D is advisable. The log 2 constant

has a halving effect.

3. Merge Mix or fuse two subclasses into a single one. That is, the respective weight

vectors of both subclasses are fused into a single one if they are close enough. Sub-

classes j and j0 in class i are merged if D(wij, wij0) , mmerge8i and 8j, j0, where D

represents a distance measure and mmerge is the merging threshold. In our simu-

lations, D was taken to be the Euclidean distance.

After merging subclasses j and j0, a new weight vector wijtþ1 and bias bij

tþ1 are

constructed according to

wtþ1ij ¼

12(wt

ij þ wtij0 )

btþ1ij ¼ log½exp (bt

ij)þ exp (btij0)�

(3:36)

3.3.4 Implementation of PPMS

An iteration of PPMS can be implemented after each M-step during learning or after

several iterations of the EM algorithm.

In our simulations, we have explored the implementation of PPMS after each M-step,

where the M-step is reduced to a single iteration of rules (3.15) and (3.16). Also, to reduce

computational demands, prior probabilities are not estimated using Eq. (3.32) but

are updated iteratively as bP tþ1ij ¼ (1� at)bP tþ1

ij þ atytij, where 0 � a t � 1. Note that, if

prior probabilities are initially nonzero and sum to 1, the updating rule preserves the sat-

isfaction of these probability constraints.

After each application of PPMS, the prior estimates must also be updated. This is

done as follows:

1. Pruning When a subclass is removed, the a priori probability estimates of the

remaining classes do not sum to 1. This inconsistency can be resolved by redistri-

buting the prior estimate of the removed subclass proportionally among the rest of

the subclasses. In particular, given that condition (3.33) is true, the new prior

probability estimates bPtþ1ij0 are computed from bPt

ij0 , after pruning subclass j of

class i as follows:

bPtþ1ij0 ¼

bPtij0

PMi

n¼1bPt

inPMi

n¼1,n=jbPt

in

1 � i � L 1 � j0 � Mi j0 = j

(3:37)

2. Splitting The a priori probability of the split subclass is also divided into two equal

parts that will go to each of the descendants. That is, if subclass j in class i is split into

3.3 POSTERIOR PROBABILITY ESTIMATION BASED ON GSP NETWORKS 51

subclasses j and j0, prior estimates are assigned to new subclasses as

bPtþ1ij ¼

12bPt

ijbPtþ1

ij0 ¼12bPt

ij0 (3:38)

3. Merging Probability vectors are modified in accordance with probabilistic laws;

see (3.9). Formally speaking, if subclasses j and j0 in class i are merged into

subclass j,

bPtþ1ij ¼

bPtij þ

bPtij0 (3:39)

3.4 MICROCALCIFICATION DETECTION SYSTEM

3.4.1 Background Notes in MicrocalcificationAutomatic Detection

Breast cancer is a major public health problem. It is known that about 40–75 new cases per

100,000 women are diagnosed each year in Spain, [28]. Similar quantities have been

observed in the United States; see [4].

Mammography is the current clinical procedure followed to the early detection of

cancer but ultrasound imaging of the breast is gaining importance nowadays. Radio-

graphic findings in breast cancer can be divided into two categories: MCCs and masses

[4]. About the former, debris and necrotic cells tend to calcify, so they can be detected

by radiography. It has been shown that between 30 and 50% of breast carcinomas detected

radiographically showed clustered MCCs, while 80% revealed MCCs after microscopic

examination. About the latter, most cancers that present as masses are invasive cancers

(i.e., tumors that have extended beyond the duct lumen) and are usually harder to detect

due to the similarity of mass lesions with the surrounding normal parenchymal tissue.

Efforts in automatically detecting either of the two major symptoms of breast cancer

started in 1967 [29] and are numerous since then. Attention has been paid to image

enhancement, feature definition and extraction, and classifier design. The reader may

want to consult [4] for an interesting introduction to these topics and follow the numerous

references included there. What we would like to point out is that, to the best of our knowl-

edge, the idea of using posteriors to build a classifier and to provide the physician with a

confidence in the decision based on posterior probabilities has not been explored so far.

This is our goal in what follows.

3.4.2 System Description

In Figure 3.3, we show a general scheme of the system we propose in order to detect

MCCs; the system will be window oriented, and, for each window, the a posteriori prob-

ability that a MCC is present will be computed; to that end, we make use of the GSP and

the PPMS algorithm, introduced respectively in Sections 3.3.1 and 3.3.3. The NN will be

fed with features calculated from the pixel intensities within each window. A brief descrip-

tion of these features is given in the forthcoming section.

The tessellation of the image causes MCC detections to obey a window-oriented

pattern; in addition, since decisions are made independently in each window, no spatial

coherence is forced so, due to FNs, clustered MCCs may show up as a number of separated

positive decisions. In order to obtain realistic results about the MCC shapes we have

defined a MCC segmentation procedure; this is a regularization procedure consisting of

two stages: First, a minimum distance is defined between detected MCCs so as to consider

52 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS

that detected MCCs within this distance belong to the same cluster. Then an approxi-

mation to the convex hull of this cluster is found by drawing the minimum rectangle

(with sides parallel to the image borders) that encloses all the detected MCCs within

the cluster. The final stage is a pixel-level decision: A decision for each pixel within

the rectangle just drawn (and not only within the detected MCCs) to belong or not to the

cluster is made. This is done by testing whether the pixel intensity falls inside the range of

the intensities of the detected MCCs within the cluster.

3.4.3 Definition of Input Features

In [30], the authors define, on both the spatial and the frequency domains, a set of four

features as inputs to an NN for MCC detection. The two spatial features are the pixel inten-

sity variance and the pixel energy variance. Both are calculated as sample variances within

each processing window, and the pixel energy is defined as the square of the pixel inten-

sity. The two frequency features are the block activity and the spectral entropy. The former

is the summation of the absolute values of the coefficients of the discrete cosine transform

(DCT) of the pixels within the block (i.e., the window), with the direct-current coefficient

excluded. The latter is the classical definition of the entropy, calculated in this case on the

normalized DCT coefficients, where the normalization constant is the block activity.

This set of features constitute the four-dimensional input vector that we will use for

our neural GSP architecture. Needless to say, these features compute search for MCC

boundaries; MCC interiors, provided that they fit entirely within an inspection window,

are likely to be missed. However, the regularization procedure will eliminate these losses.

3.5 RESULTS

The original mammogram database images were in the Lumiscan75 scanner format.

Images had an optical spatial resolution of 150 dpi or, equivalently, about 3500 pixels/cm2, which is a low resolution for today’s systems and close to the lower limit of the res-

olution needed to detect most MCCs to some authors’ judgment. We internally used all the

12 bits depth during computations. The database consisted of images of the whole breast

area and others of more detailed areas specified by expert radiologists.

Figure 3.3 The MCC system block diagram: an overview.

3.5 RESULTS 53

The window size was set to 8 � 8 square pixels, obtaining the best MCC detection

and segmentation results with that value after some trial and error testing. We defined

both a training set and a test set. All the mammogram images shown here belong to the

test set. We made a double-windowed scan of the original mammograms with half-

window interleave in the detection process in order to increase the detection capacity of

the system.

During simulations, the number of subclasses per class is randomly initialized as

well as the weight matrix W of the GSP NN; the learning step rt was decreased to pro-

gressively freeze learning as the algorithm evolves. The PPMS parameters were empiri-

cally determined based on trial and error (no critical dependencies were observed in

these values, though), and the initial values were mprune0 ¼ 0.025, msplit

0 ¼ 0.250, and

mmerge0 ¼ 0.1. After applying PPMS we obtained a GSP network complexity with one

subclass for the class MCC-present and seven subclasses for the class MCC-absent.

In Figure 3.4 a number of regions of interest (ROIs) have been drawn by an expert;

the result of the algorithm is shown on the right, in which detected MCCs have been

marked in white on the original mammogram background. Figure 3.5 shows details of

the ROI in the bottom-right corner of Figure 3.4; Figure 3.5a shows the original mammo-

gram, Figure 3.5b shows the detected MCCs in white, and Figure 3.5c shows the posterior

probabilities of a single pass indicated in each inspection window. As is clear from the

image, if an inspection window fits within a MCC, the posterior probability is fairly

small. However, this effect is removed by subsequent regularization.

Figure 3.4 (a) Original mammogram with four ROIs specified by the expert radiologist; (b) segmented

MCCs on the original image (necrosis grasa case).

54 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS

Figure 3.6 shows the results of the algorithm on a second example. In this case a

mammogram with a carcinoma is shown (original image in Fig. 3.6a); the results of the

first pass of the algorithm are shown in Figures 3.6b,c. Figure 3.6b shows the detected

MCCs superimposed on the original image and Figure 3.6c shows the posterior probabil-

ities on each inspection window. Figures 3.6d,e show the result of the second pass of the

algorithm with the same ordering as in the first pass. In this example, the sparsity of the

MCCs causes the MCCs to not fit entirely within each inspection window, so in those

inspection windows where MCCs are present, large variance values are observed,

leading to high posterior probabilities. Table 3.1 summarizes the performance of the pro-

posed MCC detection system by indicating both the FP and FN rates in our test set4 and

both including and excluding the borders of the mammogram images.5 Just as a rough

comparison we have taken the results reported in [30] as a reference. The comparison,

however, is not totally conclusive since the test images used in this chapter and in [30]

do not coincide. We have used this paper as benchmark since the features used in our

Figure 3.5 Details of ROI on bottom-left corner of Figure 3.4: (a) original portion of mammogram;

(b) segmented output (MCC in white on the original mammogram); (c) posterior probabilities of a single

pass of the algorithm on each inspection window (necrosis grasa case).

4When computing FP and FN in Table 3.1, we have considered the total number of windows, which is not always

a common way to compute those rates.5We have not used any procedure to first locate the breast within the mammogram. So by “excluding borders”

we mean manually delineating the breast and running the algorithm within. Therefore, excluding borders means

excluding sources of errors such as the presence of letters within the field of inspection, the breast boundary, etc.

3.5 RESULTS 55

Fig

ure

3.6

Sec

ond

exam

ple

:(a

)ori

gin

alim

age.

Res

ult

sof

firs

tpas

s:(b

)det

ecte

dM

CC

son

ori

gin

alim

age

and

(c)

post

erio

rpro

bab

ilit

ies

on

each

insp

ecti

on

win

dow

.R

esult

sof

seco

nd

pas

s:(d

)det

ecte

dM

CC

son

ori

gin

alim

age

and

(e)

post

erio

rpro

bab

ilit

ies

on

each

insp

ecti

on

win

dow

;(

f)

final

segm

enta

tion

(car

cinom

aca

se).

56

work have been taken from it. Specifically, denoting by TN the true negatives and by TP the

true positives, we have, with border excluded, TN ¼ 19,384 2 29 ¼ 19,355 windows and

TP ¼ 19,384 2 217 ¼ 19,167 windows or, equivalently, TN ¼ 99.85% and TP ¼

98.88%. Consequently for our test set we have obtained TP ¼ 98.88% and FP ¼ 0.1%.

The best results reported in [30] TP ¼ 90.1%, FP ¼ 0.71%. The result, although not

conclusive, suggests that the system proposed here may constitute an approach worth taking.

3.6 FUTURE TRENDS

We have presented a neural methodology to make decisions by means of posterior prob-

abilities learned from data. In addition, a complexity selection algorithm has been pro-

posed, and the whole system has been applied to detect and segment MCCs in

mammograms. We show results on images from a hospital database, and we obtain pos-

terior probabilities for the presence of a MCC with the use of the PPMS algorithm. Our

understanding is that probability estimates are a valuable piece of information; in particu-

lar, they give an idea of the certainty in the decisions made by the computer, which can be

weighed by a physician to make the final diagnosis. As of today we are not aware of

similar approaches applied to this problem. In addition, the generation of ROC curves

are being carried out in order to compare with others, although the reader should take

into account that the images from the database considered here are private, so other stan-

dard public-domain mammogram databases should be considered. Needless to say, image

enhancement may dramatically improve the performance of any decision-making pro-

cedure. This constitutes part of our current effort, since various enhancement algorithms

are being tested, together with a validation of the system described here against standard

public mammogram databases and the application of the presented algorithm and NN

methodology in other medical problems of interest [31].

TABLE 3.1.1 Total Error and Error Rates in % (FP and FN) for the Full Mammograms andMammograms with Borders Excluded Borders of Mammograms in the Test Set Over a Populationof 21,648 and 19,384, Respectively, Windowed 8 3 8 Pixel Images

Case Total error windows % Total FP %FP FN % FN

Test, full 413 1.91 104 0.48 309 1.43

Test, no borders 246 1.27 29 0.15 217 1.12

ACKNOWLEDGMENTS

The authors wish to first express their sincere gratitude

to Martiniano Mateos-Marcos for the help received

while preparing part of the figures included in the

manuscript and in programming some simulations.

They also thank the Radiology Department personnel

at the Asturias General Hospital in Oviedo for the

help received while evaluating the performance of

the system and to the radiologists at Puerta de Hierro

Hospital in Madrid for providing the mammogram

database used in training the NN. Juan I. Arribas

wants to thank Dr. Luis del Pozo for reviewing the

manuscript and Yunfeng Wu and Dr. Gonzalo

Garcıa for their help. This work was supported by

grant numbers FP6-507609 SIMILAR Network

of Excellence from the European Union and TIC01-

3808-C02-02, TIC02-03713, and TEC04-06647-

C03-01 from Comision Interministerial de Ciencia y

Tecnologıa, Spain.

ACKNOWLEDGMENTS 57

REFERENCES

1. M. I. JORDAN AND R. A. JACOBS. Hierarchical mixtures

of experts and the EM algorithm. Neural Computation,

6(2):181–214, 1994.

2. H. L. VAN TREES. Detection, Estimation and

Modulation Theory. Wiley, New York, 1968.

3. S. M. KAY. Fundamentals of Statistical Signal Proces-

sing. Detection Theory. Prentice-Hall, Englewood

Cliffs, NJ, 1998.

4. M. L. GIGER, Z. HUO, M. A. KUPINSKI, AND C. J.

VYBORNY. Computer-aided diagnois in mammography.

In M. SONKA AND J. M. FITZPATRICK, Eds., Handbook

of Medical Imaging, Vol. II. The International Society

for Optical Engineering, Bellingham, WA, 2000.

5. B. ROSNER. Fundamentals of Biostatistics. Duxbury

Thomson Learning, Pacific Grove, CA, 2000.

6. S. M. KAY. Fundamentals of Statistical Signal Proces-

sing. Estimation Theory. Prentice-Hall, Englewood

Cliffs, NJ, 1993.

7. M. A. KUPINSKI, D. C. EDWARDS, M. L. GIGER, AND

C. E. METZ. Ideal observer approximation using

Bayesian classification neural networks. IEEE Trans.

Med. Imag., 20(9):886–899, September 2001.

8. P. S. SZCZEPANIAK, P. J. LISBOA, AND J. KACPRZYK.

Fuzzy Systems in Medicine. Physica-Verlag,

Heidelberg, 2000.

9. J. MILLER, R. GODMAN, AND P. SMYTH. On loss func-

tions which minimize to conditional expected values

and posterior probabilities. IEEE Trans. Neural

Networks, 4(39):1907–1908, July 1993.

10. B. S. WITTNER AND J. S. DENKER. Strategies for teaching

layered neural networks classification tasks. In W. V. OZ

AND M. YANNAKAKIS, Eds., Neural Information

Processing Systems, vol. 1. CA, 1988, pp. 850–859.

11. D. W. RUCK, S. K. ROGERS, M. KABRISKY, M. E. OXLEY,

AND B. W. SUTER. The multilayer perceptron as an

approximation to a Bayes optimal discriminant function.

IEEE Trans. Neural Networks, 1(4):296–298, 1990.

12. E. A. WAN. Neural Network classification: A Bayesian

interpretation. IEEE Trans. Neural Networks,

1(4):303–305, December 1990.

13. R. G. GALLAGER. Information Theory and Reliable

Communication. Wiley, New York, 1968.

14. E. B. BAUM AND F. WILCZEK. Supervised learning of

probability distributions by neural networks. In D. Z.

ANDERSON, Ed., Neural Information Processing

Systems. American Institute of Physics, New York,

1988, pp. 52–61.

15. A. EL-JAROUDI AND J. MAKOUL. A new error criterion

for posterior probability estimation with neural nets.

International Joint Conference on Neural Nets, San

Diego, 1990.

16. J. I. ARRIBAS AND J. CID-SUEIRO. A model selection

algorithm for a posteriori probability estimation with

neural networks. IEEE Trans. Neural Networks,

16(4):799–809, July 2005.

17. J. I. ARRIBAS. Neural networks for a posteriori prob-

ability estimation: structures and algorithms. Ph.D.

dissertation, Electrical Engineering Department,

Universidad de Valladolid, Spain, 2001.

18. J. CID-Sueiro, J. I. ARRIBAS, S. URBAN-MUNOZ, AND

A. R. FIGUEIRAS-VIDAL. Cost functions to estimate a

posteriori probability in multiclass problems. IEEE

Trans. Neural Networks, 10(3):645–656, May 1999.

19. S. I. AMARI. Backpropagation and stochastic gradient

descent method. Neuro computing, 5:185–196, 1993.

20. B. A. TELFER AND H. H. SZU. Energy functions

for minimizing misclassification error with minimum-

complexity networks. Neural Networks, 7(5):809–818,

1994.

21. J. I. ARRIBAS, J. CID-Sueiro, T. ADALI, AND A. R.

FIGUEIRAS-VIDAL. Neural architectures for parametric

estimation of a posteriori probabilities by constrained

conditional density functions. In Y. H. HU, J. LARSEN,

E. WILSON, AND S. DOUGLAS, Eds., Neural Networks

for Signal Processing. IEEE Signal Processing

Society, Madison, WI, pp. 263–272.

22. K. CHEN, L. XU, AND H. CHI. Improved learning algor-

ithm for mixture of experts in multiclass classification.

Neural Networks, 12:1229–1252, June 1999.

23. V. N. VAPNIK. An overview of statistical learning

theory. IEEE Trans. Neural Networks, 10(5):988–999,

September 1999.

24. R. REED. Pruning algorithms—a survey. IEEE Trans.

Neural Networks, 4(5):740–747, September 1993.

25. J. FRITSCH, M. FINKE, AND A. WAIBEL. Adaptively

growing hierarchical mixture of experts. In Advances

in Neural Information Processing Systems, vol. 6.

Morgan Kaufmann Publishers, CA, 1994.

26. J. L. ALBA, L. DOCIO, D. DOCAMPO, AND O. W.

MARQUEZ. Growing Gaussian mixtures network for

classification applications. Signal Proc., 76(1):43–60,

1999.

27. L. XU, M. I. JORDAN, AND G. E. HINTON. An alternative

model for mixtures of experts. In T. LEEN, G. TESAURO,

AND D. TOURETZKY, Eds., Advances in Neural

Information Processing Systems, vol. 7. MIT Press,

Cambridge, MA, 1995, pp. 633–640.

28. Direccion General de Salud Publica. Population Screen-

ing of Breast Cancer in Spain (in Spanish). Ministerio

de Sanidad y Consumo, Madrid, Spain, 1998.

29. F. WINSBERG, M. ELKIN, J. MACY, V. BORDAZ,

AND W. WEYMOUTH. Detection of radiographic

abnormalities in mammograms by means of optical scan-

ning and computer analysis. Radiology, 89:211–215,

1967.

30. B. ZHENG,W. QIAN, AND L. CLARKE. Digital mammo-

graphy: Mixed feature neural network with spectral

entropy decision for detection of microcalcifications.

IEEE Trans. Med. Imag., 15(5):589–597, October 1996.

31. A. TRISTAN AND J. I. ARRIBAS. A radius and ulna

skeletal age assessment system. In V. CALHOUN and

T. ADALI, Eds, Machine Learning for Signal Proces-

sing, IEEE Signal Processing Society, Mystic, CN,

2005, pp. 221–226.

58 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS