Post on 27-Apr-2023
C H A P T E R3ESTIMATION OF POSTERIOR
PROBABILITIES WITH NEURAL
NETWORKS: APPLICATION TO
MICROCALCIFICATION
DETECTION IN BREAST
CANCER DIAGNOSISJuan Ignacio Arribas, Jesus Cid-Sueiro, and
Carlos Alberola-Lopez
3.1 INTRODUCTION
Neural networks (NNs) are customarily used as classifiers aimed at minimizing
classification error rates. However, it is known that the NN architectures that compute
soft decisions can be used to estimate posterior class probabilities; sometimes, it could
be useful to implement general decision rules other than the maximum a posteriori
(MAP) decision criterion. In addition, probabilities provide a confidence measure of the
classifier decisions, a fact that is essential in applications in which a high risk is involved.
This chapter is devoted to the general problem of estimating posterior class probabil-
ities using NNs. Two components of the estimation problem are discussed: Model selec-
tion, on one side, and parameter learning, on the other. The analysis assumes an NN model
called the generalized softmax perceptron (GSP), although most of the discussion can be
easily extended to other schemes, such as the hierarchical mixture of experts (HME) [1],
which has inspired part of our work, or even the well-known multilayer perceptron.
The use of posterior probability estimates is applied in this chapter to a medical
decision support system; the testbed used is the detection of microcalcifications
(MCCs) in mammograms, which is a key step in breast cancer early diagnosis.
The chapter is organized as follows: Section 3.2 discusses the estimation of posterior
class probabilities with NNs, with emphasis in a medical application; Section 3.3 discusses
learning and model selection algorithms for the GSP networks; Section 3.4 proposes a
system for MCC detection based on the GSP; Section 3.5 shows some simulation
results on detection performance using a mammogram database; and Section 3.6 provides
some conclusions and future trends.
Handbook of Neural Engineering. Edited by Metin AkayCopyright # 2007 The Institute of Electrical and Electronics Engineers, Inc.
41
3.2 NEURAL NETWORKS AND ESTIMATION OFA POSTERIORI PROBABILITIES INMEDICAL APPLICATIONS
3.2.1 Some Background Notes
A posteriori (or posterior) probabilities, or generally speaking, posterior density functions,
constitute a well-known and widely applied discipline in estimation and detection theory
[2, 3] and, more widely speaking, in pattern recognition. The concept underlying pos-
teriors is to update one’s prior knowledge about the state of nature (with the term
nature adapted to the specific problem one is dealing with) according to the observation
of reality that one may have.
Specifically, recall the problem of a binary decision, that is, the need of choosing
either hypothesis H0 or H1 having as information both our prior knowledge about the
probabilities of each hypothesis being the right one [say P(Hi), i ¼ 0, 1] and how the
observation x probabilistically behaves according to each of the two hypotheses [say
f (xjHi), i ¼ 0, 1]. In this case, it is well known that the optimum decision procedure is
to compare the likelihood ratio, that is, f (xjH1)/f (xjH0) with a threshold. The value of
the threshold is, from a Bayesian perspective, a function of the two prior probabilities
and a set of costs (set at designer needs) associated to the two decisions. Specifically
f (xjH1)
f (xjH0)_H1
H0
P(H0)(C10 � C00)
P(H1)(C01 � C11)(3:1)
with Cij the cost of choosing Hi when Hj is correct.
A straightforward manipulation of this equation leads to
P(H1jx)
P(H0jx)_H1
H0
C10 � C00
C01 � C11
(3:2)
with P(Hijx) the a posteriori probability that hypothesis Hi is true, i ¼ 0, 1. For the case
Cij ¼ 1 2 dij, with dij the Kronecker delta function, the equation above leads to the
general MAP rule, which can be rewritten, now extended to the M-ary detection
problem, as
H� ¼ arg maxHi
P(Hijx) i ¼ 1, . . . , M (3:3)
It is therefore clear that posterior probabilities are sufficient statistics for MAP
decision problems.
3.2.2 Posterior Probabilities as Aid to Medical Diagnosis
Physicians are used to working with concepts of sensitivity and specificity of a diagnosis
procedure [4, 5]. Considering H0 the absence of a pathology and H1 the presence of a path-
ology, the sensitivity1 of the test is defined as P(H1jH1) and the specificity of the test is
P(H0jH0). These quantities indicate the probabilities of a test doing fine. However, a
test may draw false positives (FPs), with probability P(H1jH0), and false negatives
(FNs), with probability P(H0jH1).
1In this case, probability P(HijHj) should be read as the probability of deciding Hi when Hj is correct.
42 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS
Clearly, the ideal test is that for which P(HijHj) ¼ dij. However, for real situations,
this is not possible. Sensitivity and specificity are not independent but are related by means
of a function known as receiver operating characteristics (ROCs). The physician should
therefore choose at which point of this curve to work, that is, how to trade FPs (also
known as 1-specificity) and FNs (also known as the sensitivity) according to the
problem at hand.
Two tests, generally speaking, will perform differently. The way to tell whether one
of them outperforms the other is to compare their ROCs. If one ROC is always above the
other, that is, if setting the specificity then the sensitivity of one test is always above that of
the other, then the former is better. Taking this as the optimality criterion, the key question
is the following: Is there a way to design an optimum test? The answer is also well known
[3, 6] and is called the Neyman–Pearson (NP) lemma, and a testing procedure that is built
upon this principle has been called the ideal observer elsewhere [7].
The NP lemma states that the optimum test is given by comparing the likelihood ratio
mentioned before with a threshold. Note that this is operationally similar to the procedure
described in Eq. (3.1). However, what changes now is the value of the threshold. This is
set by assuring that the specificity will take on a particular value, set at the physician’s will.
The reader may then wonder whether a testing procedure built upon a ratio of pos-
terior probabilities, as the one expressed in Eq. (3.2), is optimum in the sense stated in this
section. The answer is yes: The test described in Eq. (3.1) only differs in the point in the
optimum ROC on which it is operating; while a test built upon the NP lemma sets where to
work in the ROC, the test built upon posteriors does not fix where to work. Its motivation is
different: It changes prior knowledge into posterior knowledge by means of the obser-
vations and decides upon the costs given to each decision and each state of nature. This
assures that the overall risk (in a Bayesian sense) is minimum [2].
The question is therefore: Why should a physician use a procedure built upon pos-
teriors? Here are the answers:
. The test indicated in (3.2) also implements and ideal observer [7]. So, no loss of
optimality is committed.
. As we will point out in the chapter, it is simple to build NN architectures that esti-
mate posteriors out of training data. For nonparametric models, the estimation of
both posteriors and likelihood functions is essentially equivalent. However, for para-
metric models, the procedure to estimate posteriors reflects the objectives of the esti-
mation, which is not the case for the estimation of likelihood functions. Therefore, it
seems more natural to use this former approach.
. Working with posteriors, the designer is not only provided with a hard decision
about the presence or absence of a pathology but is also provided with a measure
of confidence about the decision itself. This is an important piece of additional infor-
mation that is also encoded in a natural range for human reasoning (i.e., within the
interval [0, 1]). This reminds us of well-known fuzzy logic procedures often used in
medical environments [8].
. As previously stated, posteriors indicate how our prior knowledge changes accord-
ing to the observations. The prior knowledge, in particular P(H1), is in fact the
prevalence [5] of an illness or pathology. Assume an NN is trained with data
from a region with some degree of prevalence but is then used with patients from
some other region where the prevalence of the pathology is known to be different.
In this case, accepting that symptoms (i.e., observations) are conditionally equally
distributed (CED) regardless of the region where the patient is from, it is simple
to adapt the posteriors from one region to the other without retraining the NN.
3.2 NEURAL NETWORKS AND ESTIMATION OF A POSTERIORI PROBABILITIES 43
Specifically, denote by Hij the hypothesis i(i ¼ 0, 1) in region j( j ¼ 1, 2). Then
P(H1i jx) ¼
f (xjH1i )P(H1
i )
f 1(x)
P(H2i jx) ¼
f (xjH2i )P(H2
i )
f 2(x)
(3:4)
Accepting that symptoms are CED, then f (xjHji ) ¼ f (xjHi). Therefore
P(H20 jx) ¼
f 1(x)
f 2(x)
P(H10 jx)
P(H10)
P(H20) ¼ l
P(H10 jx)
P(H10)
P(H20)
P(H21 jx) ¼ l
P(H11 jx)
P(H11)
P(H21)
(3:5)
where l ¼ f 1(x)=f 2(x). Since
P(H20 jx)þ P(H2
1 jx) ¼ 1 (3:6)
then
lP(H1
0 jx)
P(H10)
P(H20)þ
P(H11 jx)
P(H11)
P(H21)
� �¼ 1 (3:7)
which leads to
P(H20 jx) ¼
½P(H10 jx)=P(H1
0)�P(H20)
½P(H10 jx)=P(H1
0 )�P(H20 )þ ½P(H1
1 jx)=P(H11)�P(H2
1)(3:8)
which gives the procedure to update posteriors using as inputs the posteriors of the
originally trained NN and the information about the prevalences in the two regions,
information that is in the public domain.
The reader should notice that we have never stated that decisions based on posteriors are
preferable to those based on likelihood functions. Actually, the term natural has been used
twice in the reasons to back up the use of posteriors for decision making. This naturalness
should be balanced by the physician to make the final decision. In any case, the procedure
based on posteriors can be very easily forced to work, if needed, as the procedure based on
likelihoods: Simply tune the threshold on the right-hand side of Eq. (3.2) during the train-
ing phase so that some value of specificity is guaranteed.
3.2.3 Probability Estimation with NN
Consider a sample set S ¼ {(xk, dk), k ¼ 1, . . . , K}, where xk [ RN is an observation
vector and dk [ UL ¼ {u0, . . . , uL�1} is an element in the set of possible target classes.
Class i label ui [ RL has all components null but the one at the ith, row, which is unity
(i.e., classes are mutually exclusive).
In order to estimate posterior probabilities in an L-class problem, consider the struc-
ture shown in Figure 3.1, where the soft classifier is an NN computing a nonlinear mapping
gw : x! P, with parameters w, x being the input feature space and P ¼ {y [ ½0,1�Lj0 �
yi � 1,PL
i¼1 yi ¼ 1} a probability space, such that the soft decision satisfies the
44 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS
probability constraints
0 � yj � 1XL
j¼1
yj ¼ 1 (3:9)
Assuming that soft decisions are posterior probability estimates, the hard decisionbdshould be computed according to the decision criterion previously established. For
instance, under the MAP criterion, the hard decision becomes a winner-takes-all
(WTA) function.
Training pursues the calculation of the appropriate weights to estimate probabilities. It
is carried out by minimizing some average, based on samples in S, of a cost function C(y, d).
The first attempts to estimate a posteriori probabilities [9–12] were based on the
application of the square error (SE) or Euclidean distance, that is,
CSE( y, d) ¼XL
i¼1
(di � yi)2 ¼ ky� dk2 (3:10)
It is not difficult to show that, over all nonlinear mappings y ¼ g(x), the one mini-
mizing EfCSE(y, d)g is given by g(x) ¼ Efdjxg. Since Efdijxg ¼ P(Hijx), it is clear that
minimizing any empirical estimate of EfCSEg based on samples in S, estimates of posterior
probabilities are obtained.
Also, the cross entropy (CE), given by
CCE(y, d) ¼ �XL
i¼1
di log yi (3:11)
is known to provide class posterior probability estimates [13–15]. In fact, there is an infi-
nite number of cost functions satisfying this property: Extending previous results from
other authors, Miller et al. [9] derived the necessary and sufficient conditions for a cost
function to minimize to a probability, providing a closed-form expression for these
functions in binary problems. The analysis has been extended to the multiclass case in
[16–18], where it is shown that any cost function that provides posterior probability
estimates—generically called strict sense Bayesian (SSB)—can be written in the form
C(y, d) ¼ h(y)þ (d� y)Tryh(y) (3:12)
where h(y) is any strictly convex function in P which can be interpreted as an entropy
measure. As a matter of fact, this expression is also a sufficient condition: Any cost func-
tion expressed in this way provides probability estimates. In particular, if h is the Shannon
entropy, h(y) ¼ �P
i yi log yi, then Eq. (3.12) becomes the CE, so it can be easily proved
by the reader that both CE and SE cost functions are SSB.
Softclassifier
y
WTAnetwork dx
Figure 3.1 Soft-decision multiple output and WTA network.
3.2 NEURAL NETWORKS AND ESTIMATION OF A POSTERIORI PROBABILITIES 45
In order to choose an SSB cost function to estimate posterior probabilities, one may
wonder what is the best option for a given problem. Although the answer is not clear,
because there is not yet any published comparative analysis among general SSB cost func-
tions, there is some (theoretical and empirical) evidence that learning algorithms based on
CE tend to show better performance than those based on SE [10, 19, 20]. For this reason,
the CE has been used in our simulations shown later.
3.3 POSTERIOR PROBABILITY ESTIMATION BASEDON GSP NETWORKS
This section discusses the problem of estimating posterior probability maps in multiple-
hypotheses classification with the particular type of architecture that has been used in
the experiments: the GSP. First, we present a functional description of the network.
Second, we discuss training algorithms for parameter estimation and, lastly we discuss
the model selection problem in GSP networks.
3.3.1 Neural Architecture
Consider the neural architecture that, for input feature vector x, produces output yi,
given by
yi ¼XMi
j¼1
yij (3:13)
where yij are the outputs of a softmax nonlinear activation function given by
yij ¼exp (oij)PL
k¼1
PMk
m¼1 exp (okm)i ¼ 1, 2, . . . , L j ¼ 1, 2, . . . , Mi (3:14)
and oij ¼ wTijxþ bij, where wij and bij are the weight vectors and biases, respectively, L
is the number of classes, and Mi is the number of softmax outputs aggregated to
compute yi.2
The network structure is illustrated in Figure 3.2. In [16, 17, 21], this network is
named the generalized softmax perceptron (GSP). It is easy to understand that Eqs.
(3.13) and (3.14) ensure that outputs yi satisfy probability constraints given in (3.9).
Since we are interested in the application of MCC detection, we have only con-
sidered binary networks (L ¼ 2) in the simulations. In spite of this, we discuss here the
general multiclass case.
The GSP is a universal posterior probability network, in the sense that, for values of
Mi sufficiently large, it can approximate any probability map with arbitrary precision. This
fact can be easily proved by noting that the GSP can compute exact posterior probabilities
when data come from a mixture model based on Gaussian kernels with the same variance
matrix. Since a Gaussian mixture can approximate any density function, the GSP can
approximate any posterior probability map.
In order to estimate class probabilities from samples in training set S, an SSB cost
function and a search method must be selected. In particular, the stochastic gradient
2Note that, when the number of inputs is equal to 2, the softmax is equivalent to a pair of sigmoidal activation
functions.
46 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS
learning rules to minimize the CE, Eq. (3.11), are given by
wkþ1ij ¼ wk
ij � rkyk
ij
yki
(dki � yk
i )xk (3:15)
bkþ1ij ¼ bk
ij � rkyk
ij
yki
(dki � yk
i ) 1 � i � L 1 � j � Mi (3:16)
where wkij and bk
ij are the weight vectors and biases, respectively, ykij are the softmax outputs
for input x k, dik are the labels for sample xk, and rk is the learning step at iteration k.
In the following section we derive these rules from a different perspective, based on
maximum-likelihood (ML) estimation, which provides an interpretation for the softmax
outputs and some insight into the behavior of the GSP network and which suggests
some ways to design other estimation algorithms.
3.3.2 Probability Model Based on GSP
Since the GSP outputs satisfy the probability constraints, we can say that any GSP
implements a multinomial probability model of the form
P(djW, x) ¼YL
i¼1
(yi)di (3:17)
where matrix W encompasses all GSP parameters (weight vector and biases) and yi is
given by Eqs. (3.13) and (3.14). According to this, we arrive at yi ¼ P(uijW, x). This prob-
abilistic interpretation of the network outputs is useful for training: The model parameters
that best fit the data in S can be estimated by ML as
bW ¼ arg maxW
l(S, W) (3:18)
where
l(S, W) ¼XK
k¼1
log P(dkjW, xk) (3:19)
Following an analysis similar to that of Jordan and Jacobs [1] for HME, this maximi-
zation can be done iteratively by means of the expectation maximization (EM) algorithm.
To do so, let us assume that each class is partitioned into several subclasses. Let Mi
be the number of subclasses inside class i. The subclass for a given sample x will be
max
Soft-y0
y2
y1
w00
w01
w10
w11
w12
w20
x
Figure 3.2 The GSP neural architecture.
3.3 POSTERIOR PROBABILITY ESTIMATION BASED ON GSP NETWORKS 47
represented by a subclass label vector z with components zij [ {0, 1}, i ¼ 1, . . . , L,
j ¼ 1, . . . , Mi, such that one and only one of them is equal to 1 and
di ¼XMi
j¼1
zij 1 � i � L (3:20)
Also, let us assume that the joint probability model of d and z is given by
P(d, zjW, x) ¼ Id, z
YL
i¼1
YMi
j¼1
(yij)zij (3:21)
where Id,z is an indicator function equal to 1 if d and z satisfy Eq. (3.20) and equal to 0
otherwise.
Note that the joint model equation (3.21) is consistent with that in Eq. (3.17), in the
sense that the latter results from the former by marginalization,Xz
P(d, zjW, x) ¼ P(djW, x) (3:22)
Since there is no information about subclasses in S, learning can be supervised at the
class level but must be unsupervised at the subclass level. The EM algorithm based on
hidden variables z k to maximize log-likelihood l(S, W) in (3.19) provides us with a
way to proceed. Consider the complete data set Sc ¼ {(xk, dk, zk), k ¼ 1, . . . , K},
which extends S by including subclass labels. According to Eq. (3.21), the complete
data likelihood is
lc(Sc, W) ¼XK
k¼1
log P(dk, zkjW, xk) ¼XK
k¼1
XL
i¼1
XMi
j¼1
zkij log yk
ij (3:23)
The EM algorithm for ML estimation of the GSP posterior model proceeds, at
iteration t, in two steps:
1. E-step: Compute Q(W, W t) ¼ E{lc(Sc, W)jS, Wt}.
2. M-step: Compute W tþ1 ¼ arg maxW Q(W, W t).
In order to apply the E-step to lc, note that, given S and assuming that true par-
ameters are W t, the only unknown components in lc are the hidden variables. Therefore,
Q(W, Wt) ¼XK
k¼1
XL
i¼1
XMi
j¼1
E{zkijjS, Wt} log yk
ij (3:24)
Let us use compact notation yijkt to refer to the softmax output (relative to class i and
subclass j ) for input x k at iteration t (and the same for zijkt and yi
kt). Then
E{zkijjS, Wt} ¼ E{zk
ijjxk, dk, Wt} ¼ dk
i
yktij
ykti
(3:25)
Substituting (3.25) in (3.24), we get
Q(W, Wt) ¼XK
k¼1
XL
i¼1
dki
XMi
j¼1
yktij
ykti
log ykij (3:26)
48 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS
Therefore, the M-step reduces to
Wtþ1 ¼ arg maxW
XK
k¼1
XL
i¼1
dki
XMi
j¼1
yktij
ykti
log ykij (3:27)
Note that, during the M-step, only yijk depends on W. Maximization can be done in
different ways. For the HME, the iteratively reweighted least squares algorithm [1] or the
Newton–Raphson algorithm [22] has been explored. A more simple solution (that is also
suggested in [1]) consists of replacing the M-step by a single iteration of a gradient
search rule,
Wtþ1 ¼Wt � rtXK
k¼1
XL
i¼1
dki
XMi
j¼1
1
ykti
rWykij
����W¼Wt
(3:28)
which leads to
wtþ1mn ¼ wt
mn � rtXK
k¼1
XL
i¼1
dki
XMi
j¼1
yktij
ykti
(dm�idn�j � yktmn)xk
¼ wtmn � rt
XK
k¼1
yktmn
yktm
(dkm � ykt
m)xk (3:29)
btþ1mn ¼ bt
mn � rtXK
k¼1
yktmn
yktm
(dkm � ykt
m) (3:30)
where rt is the learning step. A further simplification can be done by replacing the
gradient search rule by an incremental version. In doing so, rules (3.15) and (3.16)
result.
3.3.3 A Posteriori Probability Model Selection Algorithm
While the number of classes is fixed and assumed known, the number of subclasses inside
each class is unknown and must be estimated from samples during training: It is a well-
known fact that a high number of subclasses may lead to data overfitting, a situation in
which the cost averaged over the training set is small but the cost averaged over a test
set with new samples is high.
The problem of determining the optimal network size, also known as the model
selection problem, is as well known as, in general, difficult to solve; see [16, 17, 23].
The selected architecture must find a balance between the approximation power of
large networks and the usually higher generalization capabilities of small networks.
A review of model selection algorithms is beyond the scope of this chapter. The
algorithm we propose to determine the GSP configuration, which has been called the
a posteriori probability model selection (PPMS) algorithm [16, 17], belongs to the family
of growing and pruning algorithms [24]: Starting from a pre-defined architecture,
subclasses are added to or removed from the network during learning according to
needs. The PPMS algorithm determines the number of subclasses by seeking a balance
between generalization capability and learning toward minimal output errors.
Although PPMS could be easily extended to other networks, like the HME [1], the
formulation presented here assumes a GSP architecture. The PPMS algorithm combines
pruning, splitting, and merging operations in a similar way to other algorithms proposed
in the literature, mainly for estimating Gaussian mixtures (see [25] or [26] for instance). In
3.3 POSTERIOR PROBABILITY ESTIMATION BASED ON GSP NETWORKS 49
particular, PPMS can be related to the model selection algorithm proposed in [25] where
an approach to automatically growing and pruning HME [1, 27] is proposed, obtaining
better generalization performance than traditional static and balanced hierarchies. We
observe several similarities between their procedure and the algorithm proposed here
because they also compute posterior probabilities in such a way that a path is pruned if
the instantaneous node probability of activation falls below a certain threshold.
The fundamental idea behind PPMS is the following: According to the GSP struc-
ture and its underlying probability model, the posterior probability of each class is a sum of
the subclass probabilities. The importance of a subclass in the sum can be measured by
means of its prior probability,
Ptij ¼ P{zij ¼ 1jWt} ¼
ðP{zij ¼ 1jWt, x}p(x) dx ¼
ðyijp(x) dx (3:31)
which can be approximated using samples in S as3
bPtij ¼
1
K
XK
k¼1
yktij (3:32)
A small value of bPtij is an indicator that only a few samples are being captured by
subclass j in class i, so its elimination should not affect the whole network performance
in a significant way, at least in average terms. On the contrary, a high subclass prior prob-
ability may indicate that the subclass captures too many input samples. The PPMS algor-
ithm explores this hypothesis by dividing the subclass into two halves in order to represent
this data distribution more accurately. Finally, it has also been observed that, under some
circumstances, certain weights in different subclasses of the same class tend to follow a
very similar time evolution to other weights within the same class. This fact suggests a
new action: merging similar subclasses into a unique subclass.
The PPMS algorithm implements these ideas via three actions, called prune, split,
and merge, as follows:
1. Prune Remove a subclass by eliminating its weight vector if its a priori probability
estimate is below a certain pruning threshold mprune. That is,
bPtij , mprune (3:33)
2. Split Add a newsubclass by splitting in two a subclass whose prior probability esti-
mate is greater than split threshold msplit. That is,
bPtij , msplit (3:34)
Splitting subclass (i, j ) is done by removing weight vector wijt and constructing a
pair of new weight vectors wijtþ1 and wij0
tþ1 such that, at least initially, the posterior
3Since Wt is computed based on samples in S, prior probability estimates based on the same sample set are biased.
Therefore, priors should be estimated by averaging from a validation set different from S. In spite of this, in order
to reduce the data demand, our simulations are based on only one sample set.
50 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS
probability map is approximately the same:
wtþ1ij ¼ wt
ij þ D
wtþ1ij0 ¼ wt
ij � D
btþ1ij ¼ btþ1
ij � log 2
btþ1ij0 ¼ btþ1
ij � log 2
(3:35)
It is not difficult to show that, for D ¼ 0, probability maps P(djx, Wt) and
P(djx, Wtþ1) are identical. In order that new weight vectors can evolve differently
during time, using a small nonzero value of D is advisable. The log 2 constant
has a halving effect.
3. Merge Mix or fuse two subclasses into a single one. That is, the respective weight
vectors of both subclasses are fused into a single one if they are close enough. Sub-
classes j and j0 in class i are merged if D(wij, wij0) , mmerge8i and 8j, j0, where D
represents a distance measure and mmerge is the merging threshold. In our simu-
lations, D was taken to be the Euclidean distance.
After merging subclasses j and j0, a new weight vector wijtþ1 and bias bij
tþ1 are
constructed according to
wtþ1ij ¼
12(wt
ij þ wtij0 )
btþ1ij ¼ log½exp (bt
ij)þ exp (btij0)�
(3:36)
3.3.4 Implementation of PPMS
An iteration of PPMS can be implemented after each M-step during learning or after
several iterations of the EM algorithm.
In our simulations, we have explored the implementation of PPMS after each M-step,
where the M-step is reduced to a single iteration of rules (3.15) and (3.16). Also, to reduce
computational demands, prior probabilities are not estimated using Eq. (3.32) but
are updated iteratively as bP tþ1ij ¼ (1� at)bP tþ1
ij þ atytij, where 0 � a t � 1. Note that, if
prior probabilities are initially nonzero and sum to 1, the updating rule preserves the sat-
isfaction of these probability constraints.
After each application of PPMS, the prior estimates must also be updated. This is
done as follows:
1. Pruning When a subclass is removed, the a priori probability estimates of the
remaining classes do not sum to 1. This inconsistency can be resolved by redistri-
buting the prior estimate of the removed subclass proportionally among the rest of
the subclasses. In particular, given that condition (3.33) is true, the new prior
probability estimates bPtþ1ij0 are computed from bPt
ij0 , after pruning subclass j of
class i as follows:
bPtþ1ij0 ¼
bPtij0
PMi
n¼1bPt
inPMi
n¼1,n=jbPt
in
1 � i � L 1 � j0 � Mi j0 = j
(3:37)
2. Splitting The a priori probability of the split subclass is also divided into two equal
parts that will go to each of the descendants. That is, if subclass j in class i is split into
3.3 POSTERIOR PROBABILITY ESTIMATION BASED ON GSP NETWORKS 51
subclasses j and j0, prior estimates are assigned to new subclasses as
bPtþ1ij ¼
12bPt
ijbPtþ1
ij0 ¼12bPt
ij0 (3:38)
3. Merging Probability vectors are modified in accordance with probabilistic laws;
see (3.9). Formally speaking, if subclasses j and j0 in class i are merged into
subclass j,
bPtþ1ij ¼
bPtij þ
bPtij0 (3:39)
3.4 MICROCALCIFICATION DETECTION SYSTEM
3.4.1 Background Notes in MicrocalcificationAutomatic Detection
Breast cancer is a major public health problem. It is known that about 40–75 new cases per
100,000 women are diagnosed each year in Spain, [28]. Similar quantities have been
observed in the United States; see [4].
Mammography is the current clinical procedure followed to the early detection of
cancer but ultrasound imaging of the breast is gaining importance nowadays. Radio-
graphic findings in breast cancer can be divided into two categories: MCCs and masses
[4]. About the former, debris and necrotic cells tend to calcify, so they can be detected
by radiography. It has been shown that between 30 and 50% of breast carcinomas detected
radiographically showed clustered MCCs, while 80% revealed MCCs after microscopic
examination. About the latter, most cancers that present as masses are invasive cancers
(i.e., tumors that have extended beyond the duct lumen) and are usually harder to detect
due to the similarity of mass lesions with the surrounding normal parenchymal tissue.
Efforts in automatically detecting either of the two major symptoms of breast cancer
started in 1967 [29] and are numerous since then. Attention has been paid to image
enhancement, feature definition and extraction, and classifier design. The reader may
want to consult [4] for an interesting introduction to these topics and follow the numerous
references included there. What we would like to point out is that, to the best of our knowl-
edge, the idea of using posteriors to build a classifier and to provide the physician with a
confidence in the decision based on posterior probabilities has not been explored so far.
This is our goal in what follows.
3.4.2 System Description
In Figure 3.3, we show a general scheme of the system we propose in order to detect
MCCs; the system will be window oriented, and, for each window, the a posteriori prob-
ability that a MCC is present will be computed; to that end, we make use of the GSP and
the PPMS algorithm, introduced respectively in Sections 3.3.1 and 3.3.3. The NN will be
fed with features calculated from the pixel intensities within each window. A brief descrip-
tion of these features is given in the forthcoming section.
The tessellation of the image causes MCC detections to obey a window-oriented
pattern; in addition, since decisions are made independently in each window, no spatial
coherence is forced so, due to FNs, clustered MCCs may show up as a number of separated
positive decisions. In order to obtain realistic results about the MCC shapes we have
defined a MCC segmentation procedure; this is a regularization procedure consisting of
two stages: First, a minimum distance is defined between detected MCCs so as to consider
52 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS
that detected MCCs within this distance belong to the same cluster. Then an approxi-
mation to the convex hull of this cluster is found by drawing the minimum rectangle
(with sides parallel to the image borders) that encloses all the detected MCCs within
the cluster. The final stage is a pixel-level decision: A decision for each pixel within
the rectangle just drawn (and not only within the detected MCCs) to belong or not to the
cluster is made. This is done by testing whether the pixel intensity falls inside the range of
the intensities of the detected MCCs within the cluster.
3.4.3 Definition of Input Features
In [30], the authors define, on both the spatial and the frequency domains, a set of four
features as inputs to an NN for MCC detection. The two spatial features are the pixel inten-
sity variance and the pixel energy variance. Both are calculated as sample variances within
each processing window, and the pixel energy is defined as the square of the pixel inten-
sity. The two frequency features are the block activity and the spectral entropy. The former
is the summation of the absolute values of the coefficients of the discrete cosine transform
(DCT) of the pixels within the block (i.e., the window), with the direct-current coefficient
excluded. The latter is the classical definition of the entropy, calculated in this case on the
normalized DCT coefficients, where the normalization constant is the block activity.
This set of features constitute the four-dimensional input vector that we will use for
our neural GSP architecture. Needless to say, these features compute search for MCC
boundaries; MCC interiors, provided that they fit entirely within an inspection window,
are likely to be missed. However, the regularization procedure will eliminate these losses.
3.5 RESULTS
The original mammogram database images were in the Lumiscan75 scanner format.
Images had an optical spatial resolution of 150 dpi or, equivalently, about 3500 pixels/cm2, which is a low resolution for today’s systems and close to the lower limit of the res-
olution needed to detect most MCCs to some authors’ judgment. We internally used all the
12 bits depth during computations. The database consisted of images of the whole breast
area and others of more detailed areas specified by expert radiologists.
Figure 3.3 The MCC system block diagram: an overview.
3.5 RESULTS 53
The window size was set to 8 � 8 square pixels, obtaining the best MCC detection
and segmentation results with that value after some trial and error testing. We defined
both a training set and a test set. All the mammogram images shown here belong to the
test set. We made a double-windowed scan of the original mammograms with half-
window interleave in the detection process in order to increase the detection capacity of
the system.
During simulations, the number of subclasses per class is randomly initialized as
well as the weight matrix W of the GSP NN; the learning step rt was decreased to pro-
gressively freeze learning as the algorithm evolves. The PPMS parameters were empiri-
cally determined based on trial and error (no critical dependencies were observed in
these values, though), and the initial values were mprune0 ¼ 0.025, msplit
0 ¼ 0.250, and
mmerge0 ¼ 0.1. After applying PPMS we obtained a GSP network complexity with one
subclass for the class MCC-present and seven subclasses for the class MCC-absent.
In Figure 3.4 a number of regions of interest (ROIs) have been drawn by an expert;
the result of the algorithm is shown on the right, in which detected MCCs have been
marked in white on the original mammogram background. Figure 3.5 shows details of
the ROI in the bottom-right corner of Figure 3.4; Figure 3.5a shows the original mammo-
gram, Figure 3.5b shows the detected MCCs in white, and Figure 3.5c shows the posterior
probabilities of a single pass indicated in each inspection window. As is clear from the
image, if an inspection window fits within a MCC, the posterior probability is fairly
small. However, this effect is removed by subsequent regularization.
Figure 3.4 (a) Original mammogram with four ROIs specified by the expert radiologist; (b) segmented
MCCs on the original image (necrosis grasa case).
54 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS
Figure 3.6 shows the results of the algorithm on a second example. In this case a
mammogram with a carcinoma is shown (original image in Fig. 3.6a); the results of the
first pass of the algorithm are shown in Figures 3.6b,c. Figure 3.6b shows the detected
MCCs superimposed on the original image and Figure 3.6c shows the posterior probabil-
ities on each inspection window. Figures 3.6d,e show the result of the second pass of the
algorithm with the same ordering as in the first pass. In this example, the sparsity of the
MCCs causes the MCCs to not fit entirely within each inspection window, so in those
inspection windows where MCCs are present, large variance values are observed,
leading to high posterior probabilities. Table 3.1 summarizes the performance of the pro-
posed MCC detection system by indicating both the FP and FN rates in our test set4 and
both including and excluding the borders of the mammogram images.5 Just as a rough
comparison we have taken the results reported in [30] as a reference. The comparison,
however, is not totally conclusive since the test images used in this chapter and in [30]
do not coincide. We have used this paper as benchmark since the features used in our
Figure 3.5 Details of ROI on bottom-left corner of Figure 3.4: (a) original portion of mammogram;
(b) segmented output (MCC in white on the original mammogram); (c) posterior probabilities of a single
pass of the algorithm on each inspection window (necrosis grasa case).
4When computing FP and FN in Table 3.1, we have considered the total number of windows, which is not always
a common way to compute those rates.5We have not used any procedure to first locate the breast within the mammogram. So by “excluding borders”
we mean manually delineating the breast and running the algorithm within. Therefore, excluding borders means
excluding sources of errors such as the presence of letters within the field of inspection, the breast boundary, etc.
3.5 RESULTS 55
Fig
ure
3.6
Sec
ond
exam
ple
:(a
)ori
gin
alim
age.
Res
ult
sof
firs
tpas
s:(b
)det
ecte
dM
CC
son
ori
gin
alim
age
and
(c)
post
erio
rpro
bab
ilit
ies
on
each
insp
ecti
on
win
dow
.R
esult
sof
seco
nd
pas
s:(d
)det
ecte
dM
CC
son
ori
gin
alim
age
and
(e)
post
erio
rpro
bab
ilit
ies
on
each
insp
ecti
on
win
dow
;(
f)
final
segm
enta
tion
(car
cinom
aca
se).
56
work have been taken from it. Specifically, denoting by TN the true negatives and by TP the
true positives, we have, with border excluded, TN ¼ 19,384 2 29 ¼ 19,355 windows and
TP ¼ 19,384 2 217 ¼ 19,167 windows or, equivalently, TN ¼ 99.85% and TP ¼
98.88%. Consequently for our test set we have obtained TP ¼ 98.88% and FP ¼ 0.1%.
The best results reported in [30] TP ¼ 90.1%, FP ¼ 0.71%. The result, although not
conclusive, suggests that the system proposed here may constitute an approach worth taking.
3.6 FUTURE TRENDS
We have presented a neural methodology to make decisions by means of posterior prob-
abilities learned from data. In addition, a complexity selection algorithm has been pro-
posed, and the whole system has been applied to detect and segment MCCs in
mammograms. We show results on images from a hospital database, and we obtain pos-
terior probabilities for the presence of a MCC with the use of the PPMS algorithm. Our
understanding is that probability estimates are a valuable piece of information; in particu-
lar, they give an idea of the certainty in the decisions made by the computer, which can be
weighed by a physician to make the final diagnosis. As of today we are not aware of
similar approaches applied to this problem. In addition, the generation of ROC curves
are being carried out in order to compare with others, although the reader should take
into account that the images from the database considered here are private, so other stan-
dard public-domain mammogram databases should be considered. Needless to say, image
enhancement may dramatically improve the performance of any decision-making pro-
cedure. This constitutes part of our current effort, since various enhancement algorithms
are being tested, together with a validation of the system described here against standard
public mammogram databases and the application of the presented algorithm and NN
methodology in other medical problems of interest [31].
TABLE 3.1.1 Total Error and Error Rates in % (FP and FN) for the Full Mammograms andMammograms with Borders Excluded Borders of Mammograms in the Test Set Over a Populationof 21,648 and 19,384, Respectively, Windowed 8 3 8 Pixel Images
Case Total error windows % Total FP %FP FN % FN
Test, full 413 1.91 104 0.48 309 1.43
Test, no borders 246 1.27 29 0.15 217 1.12
ACKNOWLEDGMENTS
The authors wish to first express their sincere gratitude
to Martiniano Mateos-Marcos for the help received
while preparing part of the figures included in the
manuscript and in programming some simulations.
They also thank the Radiology Department personnel
at the Asturias General Hospital in Oviedo for the
help received while evaluating the performance of
the system and to the radiologists at Puerta de Hierro
Hospital in Madrid for providing the mammogram
database used in training the NN. Juan I. Arribas
wants to thank Dr. Luis del Pozo for reviewing the
manuscript and Yunfeng Wu and Dr. Gonzalo
Garcıa for their help. This work was supported by
grant numbers FP6-507609 SIMILAR Network
of Excellence from the European Union and TIC01-
3808-C02-02, TIC02-03713, and TEC04-06647-
C03-01 from Comision Interministerial de Ciencia y
Tecnologıa, Spain.
ACKNOWLEDGMENTS 57
REFERENCES
1. M. I. JORDAN AND R. A. JACOBS. Hierarchical mixtures
of experts and the EM algorithm. Neural Computation,
6(2):181–214, 1994.
2. H. L. VAN TREES. Detection, Estimation and
Modulation Theory. Wiley, New York, 1968.
3. S. M. KAY. Fundamentals of Statistical Signal Proces-
sing. Detection Theory. Prentice-Hall, Englewood
Cliffs, NJ, 1998.
4. M. L. GIGER, Z. HUO, M. A. KUPINSKI, AND C. J.
VYBORNY. Computer-aided diagnois in mammography.
In M. SONKA AND J. M. FITZPATRICK, Eds., Handbook
of Medical Imaging, Vol. II. The International Society
for Optical Engineering, Bellingham, WA, 2000.
5. B. ROSNER. Fundamentals of Biostatistics. Duxbury
Thomson Learning, Pacific Grove, CA, 2000.
6. S. M. KAY. Fundamentals of Statistical Signal Proces-
sing. Estimation Theory. Prentice-Hall, Englewood
Cliffs, NJ, 1993.
7. M. A. KUPINSKI, D. C. EDWARDS, M. L. GIGER, AND
C. E. METZ. Ideal observer approximation using
Bayesian classification neural networks. IEEE Trans.
Med. Imag., 20(9):886–899, September 2001.
8. P. S. SZCZEPANIAK, P. J. LISBOA, AND J. KACPRZYK.
Fuzzy Systems in Medicine. Physica-Verlag,
Heidelberg, 2000.
9. J. MILLER, R. GODMAN, AND P. SMYTH. On loss func-
tions which minimize to conditional expected values
and posterior probabilities. IEEE Trans. Neural
Networks, 4(39):1907–1908, July 1993.
10. B. S. WITTNER AND J. S. DENKER. Strategies for teaching
layered neural networks classification tasks. In W. V. OZ
AND M. YANNAKAKIS, Eds., Neural Information
Processing Systems, vol. 1. CA, 1988, pp. 850–859.
11. D. W. RUCK, S. K. ROGERS, M. KABRISKY, M. E. OXLEY,
AND B. W. SUTER. The multilayer perceptron as an
approximation to a Bayes optimal discriminant function.
IEEE Trans. Neural Networks, 1(4):296–298, 1990.
12. E. A. WAN. Neural Network classification: A Bayesian
interpretation. IEEE Trans. Neural Networks,
1(4):303–305, December 1990.
13. R. G. GALLAGER. Information Theory and Reliable
Communication. Wiley, New York, 1968.
14. E. B. BAUM AND F. WILCZEK. Supervised learning of
probability distributions by neural networks. In D. Z.
ANDERSON, Ed., Neural Information Processing
Systems. American Institute of Physics, New York,
1988, pp. 52–61.
15. A. EL-JAROUDI AND J. MAKOUL. A new error criterion
for posterior probability estimation with neural nets.
International Joint Conference on Neural Nets, San
Diego, 1990.
16. J. I. ARRIBAS AND J. CID-SUEIRO. A model selection
algorithm for a posteriori probability estimation with
neural networks. IEEE Trans. Neural Networks,
16(4):799–809, July 2005.
17. J. I. ARRIBAS. Neural networks for a posteriori prob-
ability estimation: structures and algorithms. Ph.D.
dissertation, Electrical Engineering Department,
Universidad de Valladolid, Spain, 2001.
18. J. CID-Sueiro, J. I. ARRIBAS, S. URBAN-MUNOZ, AND
A. R. FIGUEIRAS-VIDAL. Cost functions to estimate a
posteriori probability in multiclass problems. IEEE
Trans. Neural Networks, 10(3):645–656, May 1999.
19. S. I. AMARI. Backpropagation and stochastic gradient
descent method. Neuro computing, 5:185–196, 1993.
20. B. A. TELFER AND H. H. SZU. Energy functions
for minimizing misclassification error with minimum-
complexity networks. Neural Networks, 7(5):809–818,
1994.
21. J. I. ARRIBAS, J. CID-Sueiro, T. ADALI, AND A. R.
FIGUEIRAS-VIDAL. Neural architectures for parametric
estimation of a posteriori probabilities by constrained
conditional density functions. In Y. H. HU, J. LARSEN,
E. WILSON, AND S. DOUGLAS, Eds., Neural Networks
for Signal Processing. IEEE Signal Processing
Society, Madison, WI, pp. 263–272.
22. K. CHEN, L. XU, AND H. CHI. Improved learning algor-
ithm for mixture of experts in multiclass classification.
Neural Networks, 12:1229–1252, June 1999.
23. V. N. VAPNIK. An overview of statistical learning
theory. IEEE Trans. Neural Networks, 10(5):988–999,
September 1999.
24. R. REED. Pruning algorithms—a survey. IEEE Trans.
Neural Networks, 4(5):740–747, September 1993.
25. J. FRITSCH, M. FINKE, AND A. WAIBEL. Adaptively
growing hierarchical mixture of experts. In Advances
in Neural Information Processing Systems, vol. 6.
Morgan Kaufmann Publishers, CA, 1994.
26. J. L. ALBA, L. DOCIO, D. DOCAMPO, AND O. W.
MARQUEZ. Growing Gaussian mixtures network for
classification applications. Signal Proc., 76(1):43–60,
1999.
27. L. XU, M. I. JORDAN, AND G. E. HINTON. An alternative
model for mixtures of experts. In T. LEEN, G. TESAURO,
AND D. TOURETZKY, Eds., Advances in Neural
Information Processing Systems, vol. 7. MIT Press,
Cambridge, MA, 1995, pp. 633–640.
28. Direccion General de Salud Publica. Population Screen-
ing of Breast Cancer in Spain (in Spanish). Ministerio
de Sanidad y Consumo, Madrid, Spain, 1998.
29. F. WINSBERG, M. ELKIN, J. MACY, V. BORDAZ,
AND W. WEYMOUTH. Detection of radiographic
abnormalities in mammograms by means of optical scan-
ning and computer analysis. Radiology, 89:211–215,
1967.
30. B. ZHENG,W. QIAN, AND L. CLARKE. Digital mammo-
graphy: Mixed feature neural network with spectral
entropy decision for detection of microcalcifications.
IEEE Trans. Med. Imag., 15(5):589–597, October 1996.
31. A. TRISTAN AND J. I. ARRIBAS. A radius and ulna
skeletal age assessment system. In V. CALHOUN and
T. ADALI, Eds, Machine Learning for Signal Proces-
sing, IEEE Signal Processing Society, Mystic, CN,
2005, pp. 221–226.
58 CHAPTER 3 ESTIMATION OF POSTERIOR PROBABILITIES WITH NEURAL NETWORKS