Boosting Multiple Experts by Joint Optimization of Decision Thresholds 1

13
Pattern Recognition and Image Analysis, Vol. 11, No. 3, 2001, pp. 529–541. Original Text Copyright © 2001 by Pattern Recognition and Image Analysis. Boosting Multiple Experts by Joint Optimization of Decision Thresholds 1 J. Kittler, Y. Yusoff, W. Christmas, T. Windeatt, and D. Windridge Center for Vision, Speech and Signal Processing, School of Electronics, Computing, and Mathematics, University of Surrey, Guildford GU2 7XH, United Kingdom e-mail: [email protected] Abstract—We consider a multiple classifier system which combines the hard decisions of experts by voting. We argue that the individual experts should not set their own decision thresholds. The respective thresholds should be selected jointly as this will allow for compensation of the weaknesses of some experts by the relative strengths of the others. We perform the joint optimization of decision thresholds for a multiple expert system by a systematic sampling of the multidimensional decision threshold space. We show the effectiveness of this approach on the important practical application of video shot cut detection. Received March 18, 2001 1 1. INTRODUCTION Among the many combination rules suggested in the literature [1–25, 30], voting is very popular. It oper- ates on class labels assigned to each pattern by the respective experts by hardening their soft decision out- puts using the maximum value selector. The vote rule output is a function of the votes received for each class in terms of these single expert class labels. Many versions of the vote combination rule exist, such as unanimous vote, threshold voting, weighted voting, and simple majority voting [14, 20]. In addition to these basic rules, the authors in [20] propose two vot- ing methods claimed to outperform the majority voting. The first method assigns a pattern to a class by the una- nimity vote; otherwise, the sample is rejected. In the second method, the authors propose a winning class to be one that has the highest vote, such that its votes are larger than the second largest vote by a particular threshold. Lam and Suen [14] give a comprehensive analysis of the behavior of the majority vote (Vote), under the assumption of conditional independence of the experts. They show that Vote with an odd number of experts produces the highest recognition rate, while voting with an even number of experts produces a better result only when errors are more costly than rejections. In this paper, we argue that individual experts should not be allowed to set their own decision thresh- olds. These should be selected jointly, as in this manner the respective weaknesses of some experts may be compensated for by the relative strengths of the others. We perform the joint optimization of decision thresh- olds for a multiple expert system by a systematic sam- pling of the multidimensional decision threshold space. 1 This paper was submitted by the authors in English. We show the effectiveness of this approach on the important practical application of video shot cut detec- tion. In implementing this, five different experts are deployed to express opinions as to whether the visual content of video material remains the same or has changed from one frame to another. The paper is organized as follows: in the next sec- tion we introduce the necessary formalism and develop the basic theory of classifier combination by joint opti- mization of thresholds of the voting experts. The appli- cation of the proposed methodology to the problem of video shot cut detection is discussed in Section 3. In Section 4, we draw the paper to a conclusion. 2. THEORETICAL FRAMEWORK Consider a pattern recognition problem where pat- tern Z is to be assigned to one of the m possible classes {ω 1 , …., ω m }. Let us assume that we make R vector observations x i i = 1, …, R on the given pattern and that the i th measurement vector is the input to the i th expert modality. We shall assume that these observations are provided by different logical sensors. Logical sensors, of course, could also generate features that are corre- lated. However, for the sake of simplicity, we shall assume that the vectors of measurements extracted by different logical sensors are conditionally statistically independent. Although this assumption does not need to hold fully in practice, the effectiveness of the method depends on the expert outputs exhibiting diversity. In the measurement space each class ω k is modeled by the probability density function p(x i k ) and its a priori probability of occurrence is denoted by P(ω k ). We shall consider the models to be mutually exclusive, meaning that only one model can be associated with each pattern. Now according to the Bayesian theory, given mea- surements x i , i = 1, …, R, the pattern, Z, should be assigned to class ω j (i.e., its label θ should assume

Transcript of Boosting Multiple Experts by Joint Optimization of Decision Thresholds 1

Pattern Recognition and Image Analysis, Vol. 11, No. 3, 2001, pp. 529–541.Original Text Copyright © 2001 by Pattern Recognition and Image Analysis.

Boosting Multiple Experts by Joint Optimization of Decision Thresholds

1

J. Kittler, Y. Yusoff, W. Christmas, T. Windeatt, and D. Windridge

Center for Vision, Speech and Signal Processing, School of Electronics, Computing, and Mathematics, University of Surrey, Guildford GU2 7XH, United Kingdom

e-mail: [email protected]

Abstract

—We consider a multiple classifier system which combines the hard decisions of experts by voting.We argue that the individual experts should not set their own decision thresholds. The respective thresholdsshould be selected jointly as this will allow for compensation of the weaknesses of some experts by the relativestrengths of the others. We perform the joint optimization of decision thresholds for a multiple expert systemby a systematic sampling of the multidimensional decision threshold space. We show the effectiveness of thisapproach on the important practical application of video shot cut detection.

Received March 18, 2001

1

1. INTRODUCTION

Among the many combination rules suggested inthe literature [1–25, 30], voting is very popular. It oper-ates on class labels assigned to each pattern by therespective experts by hardening their soft decision out-puts using the maximum value selector. The vote ruleoutput is a function of the votes received for each classin terms of these single expert class labels.

Many versions of the vote combination rule exist,such as unanimous vote, threshold voting, weightedvoting, and simple majority voting [14, 20]. In additionto these basic rules, the authors in [20] propose two vot-ing methods claimed to outperform the majority voting.The first method assigns a pattern to a class by the una-nimity vote; otherwise, the sample is rejected. In thesecond method, the authors propose a winning class tobe one that has the highest vote, such that its votes arelarger than the second largest vote by a particularthreshold. Lam and Suen [14] give a comprehensiveanalysis of the behavior of the majority vote (Vote),under the assumption of conditional independence ofthe experts. They show that Vote with an odd number ofexperts produces the highest recognition rate, whilevoting with an even number of experts produces a betterresult only when errors are more costly than rejections.

In this paper, we argue that individual expertsshould not be allowed to set their own decision thresh-olds. These should be selected jointly, as in this mannerthe respective weaknesses of some experts may becompensated for by the relative strengths of the others.We perform the joint optimization of decision thresh-olds for a multiple expert system by a systematic sam-pling of the multidimensional decision threshold space.

1

This paper was submitted by the authors in English.

We show the effectiveness of this approach on theimportant practical application of video shot cut detec-tion. In implementing this, five different experts aredeployed to express opinions as to whether the visualcontent of video material remains the same or haschanged from one frame to another.

The paper is organized as follows: in the next sec-tion we introduce the necessary formalism and developthe basic theory of classifier combination by joint opti-mization of thresholds of the voting experts. The appli-cation of the proposed methodology to the problem ofvideo shot cut detection is discussed in Section 3. InSection 4, we draw the paper to a conclusion.

2. THEORETICAL FRAMEWORKConsider a pattern recognition problem where pat-

tern Z is to be assigned to one of the

m

possible classes{

ω

1

, ….,

ω

m

}. Let us assume that we make

R

vectorobservations

x

i

i

= 1, …,

R

on the given pattern and thatthe

i

th measurement vector is the input to the

i

th expertmodality. We shall assume that these observations areprovided by different logical sensors. Logical sensors,of course, could also generate features that are corre-lated. However, for the sake of simplicity, we shallassume that the vectors of measurements extracted bydifferent logical sensors are conditionally statisticallyindependent. Although this assumption does not needto hold fully in practice, the effectiveness of the methoddepends on the expert outputs exhibiting diversity.

In the measurement space each class

ω

k

is modeledby the probability density function

p

(

x

i

k

) and its

apriori

probability of occurrence is denoted by

P

(

ω

k

).We shall consider the models to be mutually exclusive,meaning that only one model can be associated witheach pattern.

Now according to the Bayesian theory, given mea-surements

x

i

,

i

= 1, …,

R

, the pattern,

Z

, should beassigned to class

ω

j

(i.e., its label

θ

should assume

530

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 11

No. 3

2001

KITTLER

et al

.

value

θ

=

ω

j

), on the condition that the

a posteriori

probability of that interpretation is maximum; i.e.,

(1)

It has been shown elsewhere [13] that under theassumption of independence the decision rule (1) canbe expressed or approximated as

(2)

which combines the individual classifier outputs interms of a product. Under the additional assumption ofthe sensor measurement information content being low,the decision rule becomes

(3)

which combines the expert outputs in terms of a sum.In (2) and (3)

P

(

θ

=

ω

k

|

x

i

) is the

k

th class

a posteriori

probability computed by each of the

R

classifiers.In the following discussion, we shall focus on the

benevolent fusion strategy represented by (3). It isreferred to as benevolent because it is less sensitive toestimation errors than the product rule in (2) as shownin [13]. The sum rule computation is captured schemat-ically in Fig. 1.

assign θ ωj if P θ ωj x1 … xR, ,=( )

= P θ ωk x1 … xR, ,=( ).k

max

assign θ ωj ifP θ ωj xi=( )

P θ ωj=( )-------------------------------

i 1=

R

∏ P θ ωj=( )

= maxP θ ωk xi=( )

P θ ωk=( )--------------------------------

i 1=

R

∏ P θ ωk=( ),k = 1

m

assign θ ωj if 1 R–( )P θ ωj=( )

+ P θ ωj xi=( )i 1=

R

= max 1 R–( )P θ ωk=( ) P θ ωk xi=( )i 1=

R

∑+ ,k = 1

m

Although this fusion rule operates directly on thesoft decision outputs of the experts, it has recently beenshown in [26] that for heavy tail error distributions thevote fusion rule, which combines multiple expert out-puts hardened by the max operation, gives better perfor-mance. This combination rule, shown in Fig. 2, also hasthe advantage that it can be utilized by multiple classi-fiers that do not compute or provide access to the a pos-teriori class probabilities.

Let us consider the vote fusion rule in the two classcase. The ith expert will cast vote v(ωj |xi) for class ωj

according to

(4)

The vote decision rule is then given as

(5)

Note that in Eq. (4) we implicitly use a fixed decisionthreshold for hardening each expert output beforefusion. It is therefore conceivable that the fused systemperformance could be enhanced by allowing thesethresholds to be optimized. In the absence of the knowl-edge of the a posteriori probability distributions this canbe done empirically by sampling the multidimensionalparametric space of thresholds and evaluating the per-formance at each sampled point using an independentset of data. Once these thresholds ti, i = 1, …, R havebeen determined, the fusion rule, schematically repre-

v ωj xi( )

= 1 if P θ ωj xi=( ) max P θ ωk xi=( )=

0 otherwise.

k = 1

2

assign θ ωj

if v ωj xi( )i 1=

R

∑ max v ωk xi( ).i 1=

R

∑=k = 1

m

......

......

......

......

......

......

.....

x1

xR P(ω|xR)

P(ω|x1)

decisionΣ

Expert1

ExpertR

Fig. 1. Sum fusion rule.

x1

xR P(ω|xR)

P(ω|x1)

decisionΣ

vote

vote

max

max

Expert1

ExpertR

......

......

......

......

......

......

.....

Fig. 2. Vote fusion rule.

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION 531

sented in Fig. 3, derives the optimal decision using (5)with v(ω1|xi) defined as

(6)

v(ω2 |xi) is then simply computed as

(7)

In the practical application of this approach describedin detail in the next section, instead of the a posterioriclass probabilities, we evaluate a dissimilarity measured(θ = ωj |xi) j = 1, 2. Thus, Eq. (6) is replaced by

(8)

In the following section, we apply the proposed methodto the problem of video shot cut detection to demon-strate its effectiveness.

3. VIDEO SHOT CUT DETECTION

3.1. Introduction

The partitioning of video sequences into shots isconsidered an integral part of indexing and annotatingvideo [31–40]. Shot boundary detection is often thestarting point in constructing a content-based video

v ω1 xi( )1 if P θ ω1 xi=( ) ti≥

0 otherwise

=

v ω2 xi( ) 1 v ω1 xi( ).–=

v ω1 xi( )1 if d θ ω1 xi=( ) ti≤

0 otherwise.

=

indexing system. Its primary aim is to remove the tem-poral redundancy of frames recording the content of ascene from a relatively stable viewpoint. Shot cut detec-tion is a prerequisite for any attempts to exploit the hier-archical nature of digital video—this hierarchy consistsof the whole video sequence at the top which can bebroken down into segments, then, scenes followed byshots and, finally, the individual frames. Apart from theindividual frames which make up the video sequence,the shot is the lowest denominator within the hierarchi-cal structure.

In [27], the authors described the shot as the funda-mental film component, and Picard [28] described theshot as an unbroken sequence of frames from one cam-era. The moment of change from one shot to another,the shot boundary, can be created in several ways. Thesimplest of these is the camera cut. Figure 4 shows anexample of this. Frames (a) and (b) belong to the sameshot and frames (c) and (d) to another. The content ofthe frames (b) and (c) are unrelated.

There are other effects used for the demarcation of ashot boundary such as the cross-fade (dissolve) orzoom. While the change for a shot cut takes placebetween two frames, other shot boundary effects takeplace over a number of frames. The number of framesin which the shot change occurs depends on the pro-ducer of the video sequence.

In this paper, we restrict ourselves to the study ofshot cuts only. There are several approaches to theproblem as discussed in [29]. We can define a shotdetection method as a process or system that employs adissimilarity measure over some feature of the videosequence. Shot cuts are deemed to be detected if anadopted dissimilarity measure computed between twoconsecutive frames exceeds a specified threshold.

For a shot detection method to be successful, itneeds to be as accurate as possible. This accuracy isnormally measured in terms of the percentage of trueshot changes that it is able to detect, as well as the num-ber of false positives. Needless to say, the choice of athreshold directly affects these performance measures.

In the work reported in [29], we studied a selectionof these methods and evaluated their individualstrengths and weaknesses. While each method had thecapability of performing quite well, there was still

x1

xR P(ω|xR)

P(ω|x1)

decisionΣ

vote

vote

Expert1

ExpertR

......

......

......

......

......

......

.....

Fig. 3. Vote fusion rule with jointly optimised thresholds.

Fig. 4. Example of a camera cut/break.

532

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

KITTLER et al.

scope for improvement. In particular, it became clearthat different methods performed well in diverse cir-cumstances. In other words, no simple approach out-performed another in all situations. Rather, the superi-ority of a particular method was data dependent. Thisimmediately suggested the potential benefit of usingthese various methods together; to fuse them in such away that the strengths of each method are consolidatedand the weaknesses muted.

3.2. Video Shot Cut Experts

Five separate algorithms are used to detect shotchanges. These algorithms calculate different featuresof the video data, and can by themselves be used asstand-alone shot boundary detection systems. Thesemethods are:

1. Average Intensity Measurement [AIM]We implemented this method based on that sug-

gested in [37]. The algorithm computes the average ofthe intensity values for each component (YUV, RGB,etc.) in the current frame and compares it with that forthe following frame. This is then divided by the valueof the comparison of the current frame with that of theprevious frame.

2. Euclidean Distancing [ED]For this method, we divide the frames into blocks

and perform the discrete cosine transform on eachblock. In [38], the authors observed that a Euclideandistance measure can be used to calculate the similari-ties between two images by comparing the mean of theDC values for all the blocks in the frame. Thus, we usethe DC coefficients from the DCT calculations for eachcomponent (luminance and chrominance) of successiveframes for the operation.

3. Histogram Comparison [HC]The histogram of a frame gives the distribution of

the intensities within the frame. By comparing the his-

togram of successive frames we can have a measure oftheir similarity. A number of shot cut detection tech-niques proposed in the literature are based on histogramcomparison [39–41]. Histogram comparison methodsare quite popular because they are fast. In addition,some researchers prefer such methods because they aremotion insensitive. Our implementation is similar tothat detailed in [40]. However, we extended it byincluding color components as well.

4. Likelihood Ratio [LH]

This algorithm [42] generates a measure of the like-lihood that two corresponding regions are similar. Eachregion is represented by second-order statistics underthe assumption that this property remains constant overthe region. We divide the frames into blocks, and carryout the likelihood ratio calculation over the blocks.

5. Motion Estimation / Prediction Error [ME]

In this method, we estimate the next frame in a videosequence based on the motion information in the cur-rent frame. Then, we reconstruct the next frame usingthe motion estimation vectors. Essentially, we predictwhat the next frame would look like given the informa-tion that we have. The prediction error of the recon-structed frame gives us a measure of how far off ourprediction is. Here, the motion estimation is similar tothat used in current video coding standards. For ourimplementation, we used the block-based n-step searchalgorithm for a ±2n search window as described in [43].After the motion estimation is done, the motion vectorsare used to construct the next frame in the sequence. Toobtain the prediction error, the absolute differencebetween the reconstructed frame and the original frameare calculated and summed.

As mentioned earlier, these methods calculate a dis-similarity measure which we can use to make shot cutdecisions. Therefore, the dissimilarity measure is theresponse of the system when given two consecutiveframes as input. We could then plot a response graph forthe set of values generated by each method over awhole video sequence (Fig. 5). The rationale behindusing a dissimilarity measure is that we expect two con-secutive frames belonging to the same shot to have alow value. Vice versa, for two consecutive frames ofdifferent shots, the dissimilarity measure value shouldbe large. Figure 5 shows an example of this where thepeaks in the graph are points where the dissimilaritymeasure values are high, and this would correspond toa high probability that these points are shot changeboundaries.

Each of these methods employs a global threshold atthe final stage of the processing to make the shotchange boundary decision. The choice of the thresholdsis not easily specified. In view of this, we constructed areceiver operating characteristic curve for each method,obtained by setting the thresholds to various possiblevalues.

1.2

0 100

Likelihood Ratio

Frame no.

2.0

2.4

1.6

200 300 400 500 6001.0

1.4

1.8

2.2

2.6

Fig. 5. Example of a response graph using the LikelihoodRatio method.

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION 533

Table 1 shows the data set used for the experiments.The CARTOON sequence is a collection of cartoonanimations. The CHF sequence is a children’s program.The DV sequence is a collection of daytime soap operasand documentaries. The RUGBY sequence is a rugbyunion match. The SKY sequence is a news program,and the SUPERMAN sequence is a clip from the TVshow “Superman.” The sequences CHF, DV, SKY, and

SUPERMAN are what we term “real-world”sequences. We exclude CARTOON from this groupingfor obvious reasons, and RUGBY since it is a highspeed rapid shot changing sports program. We willexplain our rationale for this division in later sections.

To construct the ROC curves, we calculate the pro-portion of undetected true shot boundaries pu against

1.0

0.6

0.4

0.2

0

0.8

Equal Error line

1.0

0.6

0.4

0.2

0

0.8

Equal Error line

1.0

0.6

0.4

0.2

0

0.8

Equal Error line

0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0

Equal Error line

Equal Error line

Equal Error line

AIMEDHCLHME

(a) Cartoon (b) CHF

(c) DV (d) Rugby

(e) SKY (f) Superman

False positives, PfFalse positives, Pf

Und

etec

ted

true

pos

itive

s, P

uU

ndet

ecte

d tr

ue p

ositi

ves,

Pu

Und

etec

ted

true

pos

itive

s, P

u

Fig. 6. The ROC curves for the video sequences.

534

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

KITTLER et al.

the proportion of incorrectly identified shot bound-aries pf

(9)

(10)

where Su is the number of undetected true shot bound-aries, Sf is the number of falsely identified ones, and Sa

is the number of actual shot boundaries.

We then set the thresholds to different values andplot pu against pf . Note that none of the experts requireany parameters to be set (apart from the threshold) and,

pu

Su

Sa

-----,=

p f

S f

Sa

-----,=

therefore, none need any training. We plot the ROCcurves for each of the five algorithms on the videosequences (Fig. 6). With the help of the ROC curves, wecan decide on a particular operating point and obtainthe corresponding threshold for each expert. For exam-ple, in terms of equal error performance, we wouldchoose the threshold that corresponds to the point onthe curve nearest to the origin from both axes, since wewould ideally want to minimize both pu and pf .

We can observe that there is a variation in the per-formance of the algorithms over the differentsequences. We can expose these variations further byshowing the points on which the algorithms are plottedgiven the same threshold value, as demonstrated inFig. 7.

0.2

0 0.5

Undetected True Positives, Pu

False Positives, Pf

CHF

RUGBYSUPERMANDV

SKY

1.0 1.5 2.0 2.5

0.4

0.6

0.8

1.0

CARTOON

Threshold = 2.80

(a)AIM

0.2

0 0.5

Undetected True Positives, Pu

False Positives, Pf

CHF

RUGBYSUPERMANDV

SKY

1.0 1.5 2.0

0.4

0.6

0.8

1.0

CARTOON

Threshold = 1.31

(b)ED

0.2

0 0.2

Undetected True Positives, Pu

False Positives, Pf

CHF

RUGBY

SUPERMAN

DVSKY

0.4 0.6 0.8

0.4

0.6

0.8

1.0

CARTOON

Threshold = 1.260

(d)LH

1.0

0.2

0 0.2

Undetected True Positives, Pu

False Positives, Pf

CHF

RUGBY

SUPERMAN

DVSKY

0.4 0.6 0.8

0.4

0.6

0.8

1.0

CARTOON

Threshold = 0.170

(c)HC

1.0

0.2

0 0.5

Undetected True Positives, Pu

False Positives, Pf

CHF

RUGBY

SUPERMANDV

SKY

1.0 1.5 2.0

0.4

0.6

0.8

1.0

CARTOON

Threshold = 33.0

(e)ME

2.5

Fig. 7. Plots of threshold values for the different sequences.

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION 535

For the four “real-world” sequences, the best indi-vidual expert would be either the HC method or the LHmethod. The observation of this inconsistent behaviorled to other authors exploiting multiple experts in tan-dem. For example, two methods were used in [44] tocreate a generalized sequence trace of the videosequence. The two features used were the luminancehistogram difference and standard deviation difference.This trace is defined as the sum of the square root of thedifference between two frames for each feature. Then,using a technique based on mathematical morphology,the authors constructed what they termed a morpholog-ical laplacian graph for the sequence. The morpholog-ical laplacian algorithm essentially calculates the dif-ference between the gradient of dilation and the gradi-ent of erosion for the generalized sequence trace, whichcorresponds to an approximation to the second deriva-tive of the aforementioned trace. The zero crossings onthe graph indicated shot boundaries. A threshold wasapplied to distinguish between zero crossings due totrue shot boundaries and noise.

In [45], the authors proposed a two step shot detec-tion strategy whereby a histogram comparison methodwas used in the first step and a likelihood ratio methodwas selectively used as the second step. In their imple-mentation, the histogram comparison results were sub-ject to two thresholds, TH and TL, TH being the higherthreshold. If the comparison result was higher than TH,a cut was declared immediately. If, on the other hand, it

was lower than TH but higher than TL, a likelihood ratiooperation was carried out. If the results of the likeli-hood ratio were above a set threshold TR, then a cut wasdeclared.

In another work, [46], a shot detection schemeemploying two algorithms was described. Using thehistogram comparison as well as a pixelwise differenc-ing algorithm (similar to AIM), the authors employed aK-means clustering algorithm to classify the resultsinto two clusters. Following this, an elimination stepbased on a heuristic observation was employed toreduce the number of false positives.

Table 1. Video sequences used in the experiments

FormatQCIF – 176 × 144

YUV 4:2:0

Frame rate 25fps

name no. of frames time, min no. of shot

cuts total

CARTOON 41750 27.8 256

CHF 36007 24 199

DV 36000 24 2681673

RUGBY 40490 27 257

SKY 45000 30 289

SUPERMAN 44921 29.9 404

2.6

2.4

2.2

2.0

1.8

1.6

1.4

1.2

1.03600 4000 4400 4800 5200

AB

C

(g)LH

Threshold = 1.28

3600 4000 4400 4800 5200

(h)ME

0

10

20

30

40

50

60

C

A

B Threshold = 32

Frame no. Frame no.

Fig. 8. Example of response graphs to the LH and ME algorithms I.

(a) (b) (c) (d) (e) (f)

536

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

KITTLER et al.

1.9

1.8

1.7

1.6

1.5

1.3

1.2

1.11.0

1.4

30900 31000 31100 31200 31300

(g)LH

A BC

Threshold = 1.28

30900 31000 31100 31200 3130005

10

15

20

25

30

35

40

45

50

(h)MEFrame no. Frame no.

A

B C

Threshold = 32

Fig. 9. Example of response graphs to the LH and ME algorithms II.

(a) (b) (c) (d) (e) (f)

The three works mentioned above [44–46] do notthus generalize to combination of arbitrary experts.

3.3. Experimental Results

The graphs in Figs. 8 and 9 show the responses ofthe LH and ME algorithms over different sections of thesame video sequence. The pairwise frames (a) and (b),

(c) and (d), (e) and (f) correspond to respective shotboundary peaks A, B, and C on the graphs. The hori-zontal line shows our optimum equal error threshold foreach algorithm. In Fig. 8, we can see that there are twopeaks (B and C) above the threshold in LH that arebelow the threshold in ME. Correspondingly, in Fig. 9,the opposite holds true where there are two peaks(again B and C) above the threshold in ME that areunder the threshold in LH.

It is also worth noting that in both examples thereare peaks (marked as A) that are below the thresholdsof both algorithms. Our manual inspection shows thatthese peaks are at bona fide shot boundaries. The fig-ures are representative of those situations which wewould like to improve upon by using experts in a coop-erative manner to detect shot boundaries such as thoseof B and C, where not all of the experts are in agree-ment as well as using suboptimal threshold positions todetect shot boundaries such as those of type A.

This requires the joint optimization of the thresholds(operating points) for each of the cooperating experts.Since the space of all the operating points is n-dimen-sional and therefore difficult to explore exhaustively,we optimize the CME (Cooperating Multiple Experts)by sampling the space quite coarsely. To this end weselect the threshold values at five points on the ROCcurve for each of the algorithms. These threshold valuesare taken at the points p1, p2, p3, p4 and p5 as shown inFig. 10.

Undetected True Positives, Pu

0.2

0 0.2False Positives, Pf

0.4 0.6 0.8 1.0

0.4

0.6

0.8

1.0

p5

p4p3

p2 p1

DV

Fig. 10. Threshold values at the points specified at the linespn where n = 1, …, 5. (DV sequence.)

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION 537

We use a base-N codeword designation to identifyour thresholds. Since there are n experts, each with Npossible thresholds, this leaves us with Nn possiblecombinations of threshold values. For our N = 5 andn = 5, this gives us 3125 possible combinations. Forexample, from Table 2, the value 04322 would meanthat we use the results from the AIM method using athreshold value of 1.5, ED using 5.91, HC using 0.170,LH using 1.240, and ME using 33.0.

Each of the individual methods would signal a shotchange at values above the given threshold. The CMEmethod would signal a shot change only when themajority of the algorithms signal a shot change.

As we wish to find an optimal setting of expert oper-ating points to get the best CME performance, werequire training data. We used the DV sequence as ourtraining set; i.e., we took the value of the thresholdsfrom the ROC curves generated from this set. The CME

Fig. 11. Close-up of the CME algorithm against the ROC curves.

0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0

Und

etec

ted

true

pos

itive

s, P

u(a) Cartoon (b) CHF

(c) DV (d) Rugby

(e) SKY (f) Superman

False positives, Pf

0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0

Und

etec

ted

true

pos

itive

s, P

u

0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0

Und

etec

ted

true

pos

itive

s, P

u

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40False positives, Pf

0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

AIMEDHCLHMECMB

538

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

KITTLER et al.

algorithm was then applied to the training set as well asthe remaining data sets.

For each of the combinations, we calculate valuesfor pu and pf . Figure 11 is a plot of the CME algorithmagainst the ROC curves that we had constructed earliershowing the region near the origin. Each point on thegraph corresponds to a different CME combination.

From Fig. 7, noting that the operating points for a giventhreshold varies from one sequence to another, we wouldexpect the performance of the CME algorithm to displaythe same behavior. This is demonstrated in Fig. 11, forexample, in the CHF sequence, where the CME cluster isshifted up and to the left of the graph compared to DV andSKY. For the CARTOON and RUGBY sequence, thevariation is even more pronounced.

We can also see in Fig. 11 that in the case of the real-world sequences, there are a significant number ofpoints from the CME algorithm that are nearer to theorigin than those of any of the individual experts. Sincethe aim is to reduce the equal error rate by multipleexpert fusion, it is clear that the proposed schemeachieves that objective for these sequences.

Looking more closely at the CARTOON sequence,it is observed that the threshold values for a given oper-

Table 2. Threshold values used in the CME algorithm

n 0 1 2 3 4

AIM 1.5 1.9 2.8 5.15 13.2ED 0.78 0.88 1.31 2.11 5.91HC 0.095 0.111 0.170 0.285 0.452LH 1.105 1.160 1.240 1.560 2.220ME 25.5 29.5 33.0 39.5 55.0

0.340.320.300.280.260.240.220.200.180.16

0.20

0.15

0.10

0.05

0

Und

etec

ted

true

pos

itive

s, P

u

False positives, Pf0.16 0.20 0.24 0.28 0.32

CartoonCHFDV

RugbySKYSuperman

0.05 0.10 0.15 0.20False positives, Pf

6.5006.300

6.100

5.900

5.700

2.500

2.600

1.4002.600

1.5002.7002.800

1.6001.700

2.900

2.800

3.100

1.800

3.0002.900

2.9003.000

2.400 2.300

0.215

0.210

0.205

0.200

0.1950.1850.1800.170

0.160 0.145 0.140 0.130

0.1450.150

0.1600.165

0.1700.175

0.1800.185

0.19

0.140

0.1500.155

0.165

0.170

0.1800.185

0.2100.195

0.220

0.230 0.165

0.1600.155

0.150

0.145

0.1400.135

0.1300.125

0.125

0.130

0.135

0.1400.270

0.260

0.255

0.245

Fig. 12. Operating points.

AIMEDHCLHMECMB

CHF + DV + SKY + Superman

False positives, Pf

0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0

Und

etec

ted

true

pos

itive

s, P

u

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

Fig. 13. Combined results of the “real-world” sequences.

(a) AIM (b) HC

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION 539

ating point for each algorithm are higher in comparisonto those of the other sequences. This is further illus-trated in Fig. 12. As such, the CME algorithm is lessefficient in this case, with only a few points demonstrat-ing any advantage over the individual experts. In thecase of the RUGBY sequence, the performance of theHC algorithm far outstripped the other experts. In addi-tion, the LH and ME algorithms performed especiallybadly for this particular sequence. Consequently, ourexperiments demonstrated that the CME algorithm isunable to improve on the performance displayed by HC.

In view of this, we can conclude that the CME algo-rithm is capable of achieving a greater performancegain in detecting shot changes compared to individualexperts when applied to “real-world” sequences.

Since each operating point in the CME algorithmrepresents a combination of operating points from theindividual experts, we still need to find an optimal com-bination. To do this, we combined the results of theCME algorithms and the individual experts for the foursequences. We then plotted the CME against the ROCcurves again (Fig. 13).

Having found the optimal operating point, we applythe combination on the individual sequences again.

From Fig. 14 it is evident that all the points produce bet-ter results than any of the individual algorithms. Eventhough the improvement is not as dramatic for all thesequences (Fig. 14d being a case in point), we havedemonstrated that no “tuning” is required to get a goodperformance. Thus, it is shown that the CME algorithmis also robust towards slight shifts in operating points.

4. CONCLUSIONS

We considered a multiple classifier system whichcombines the hard decisions of experts by voting. Weargued that the individual experts should not set theirown decision thresholds. The respective thresholdsshould be selected jointly, as this allows compensationof the weaknesses of some experts by the relativestrengths of the others. We performed the joint optimi-zation of decision thresholds for a multiple expert sys-tem by a systematic sampling of the multidimensionaldecision threshold space. We showed the effectivenessof this approach on the important practical applicationof video shot cut detection. In this application, five dif-ferent experts were deployed to express opinions as towhether the visual content of video material hadremained the same or had changed shot from one frame

0.30

0.25

0.20

0.15

0.10

0.05

0

Und

etec

ted

true

pos

itive

s, P

u

False positives, Pf

0.30

0.25

0.20

0.15

0.10

0.05

0

Und

etec

ted

true

pos

itive

s, P

u

0.300.250.200.150.100.05 0.300.250.200.150.100.050

(a) CHF (b) DV

(c) SKY (d) Superman

CME-34100 CME-34100

CME-34100CME-34100

False positives, Pf

AIMEDHCLHMECMBCME1. sel

Fig. 14. Optimal CME operating point.

540

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

KITTLER et al.

to another. We demonstrated that the proposedapproach significantly increased the true positive shotcut detection rate while reducing the false positive rate(this being the criterion of performance improvement).

ACKNOWLEDGMENTS

The support via EPSRC Grants 6R/L61095 andGR/M61320 and via EU Framework V Project Assavidis gratefully acknowledged.

REFERENCES1. Alexandre, L., Campilho, A., and Kamel, M., Combin-

ing Independent and Unbiased Classifiers UsingWeighted Average, Proc. ICPR15, vol. 2, IEEE, 2000,no. 9, pp. 495–498.

2. Ali, K. and Pazzani, M., On the Link between Error Cor-relation and Error Reduction in Decision Tree Ensem-bles, Technical Report 95-38, University of California atIrvin, 12, 1995.

3. Alkoot, F.M. and Kittler, J., Multiple Expert SystemDesign by Combined Feature Selection and ProbabilityLevel Fusion, Proc. Fusion 2000 Conf., Paris, 2000,vol. 7.

4. Bauer, E. and Kohavi, R., An Empirical Comparison ofVoting Classification Algorlthms: Bagging, Boostingand Variants, Machine Learning, 1998, pp. 1–38.

5. Breiman, L., Bagging Predictors, Machine Learning,1996, vol. 24, pp. 123–140.

6. Dietterich, T., An Experimental Comparison of ThreeMethods for Constructing Ensembles of Decision Trees:Bagging, Boosting, and Randomization, MachineLearning, 1998, pp. 1–22.

7. Duin, R.P.W. and Tax, D.M.J., Experiments with Classi-fier Combining Rules, in Multiple Classifier Systems,Kittler, J. and Roli, F., Eds., Springer, 2000, pp. 16–29.

8. Friedrich, C.M., Ensembles of Evolutionary Created Artifi-cial Neural Networks and Nearest Neighbour Classifiers, inRoy, R., Furuhashi, T., and Chawdhry, P.K., Eds., Advancesin Soft Computing, Springer, 1998, vol. 6.

9. Hansen, L.K. and Salamon, P., Neural Network Ensem-bles, IEEE Trans. Pattern Analysis and Machine Intelli-gence, 1990, vol. 12(10), pp. 993–1001.

10. Hashem, S. and Schmeiser, B., Improving Model Accu-racy Using Optimal Linear Combination of TrainedNeural Networks, IEEE Trans. Neural Networks, 1995,vol. 6(3), pp. 792–794.

11. Ho, T.K., Hull, J.J., and Srihari, S.N., Decision Combi-nation in Multiple Classifier Systems, IEEE Trans. Pat-tern Analysis and Machine Intelligence, 1994, vol. 16(1),pp. 66–75.

12. Kittler, J., Combining Classifiers: A Theoretical Frame-work, Pattern Analysis and Applications, 1998, vol. 1,pp. 18–27.

13. Kittler, J., Hatef, M., Duin, R., and Matas, J., On Com-bining Classifiers, IEEE Trans. Pattern Analysis andMachine Intelligence, 1998, vol. 20(3), pp. 226–239.

14. Lam, L. and Suen, C., Application of Majority Voting toPattern Recognition: An Analysis of Its Behaviour andPerformance, IEEE Trans. Systems, Man and Cybernet-ics, Part A: Systems and Humans, 1997, vol. 27(5),pp. 553–568.

15. Quinlan, J., Bagging, Boosting and c4.5, Proc. 13thNational Conf. on Artificial Intelligence, Portland, OR,AAAI, Menlo Park, CA, 1996, vol. 1, pp. 725–730.

16. Rahman, A.F.R. and Fairhurst, M.C., Enhancing Multi-ple Expert Decision Combination Strategies throughExploitation of a priori Information Sources, IEE Proc.Vision, Image, and Signal Processing, 1999, vol. 146-1,pp. 40–49.

17. Sharkey, A.J.C., On Combining Artificial Neural Nets,Connection Science, 1996, vol. 8(3), pp. 299–314.

18. Skalak, D.B., Prototype Selection for Composite Near-est Neighbor Classifiers, PhD Thesis, Department ofComputer Science Univ. of Massachusetts at Amherst,1997.

19. Suen, C., Legault, R., Nadal, C., Cheriet, M., andLam, L., Building a New Generation of HandwritingRecognition Systems, Pattern Recognition Letters,1993, vol. 14, pp. 303–315.

20. Xu, L., Krzyzak, A., and Suen, C.Y., Methods of Com-bining Multiple Classifiers And Their Applications toHandwriting Recognition, IEEE Transaction. SMC,1992, vol. 22(3), pp. 418–435.

21. Yu, K., Jiang, X., and Bunke, H., Lipreading: A Classi-fier Combination Approach, Pattern Recognition Let-ters, 1997, vol. 18(11–13), pp. 1421–1426.

22. Breiman, L., Friedman, J.H., Olsen, R.A., Stone, C.J.,Classification and Regression Trees, Wadsworth, Cali-fornia, 1984.

23. Kittler, J., Combining Classifiers: A Theoretical Frame-work. Pattern Analysis and Applications, 1998, vol. 1,pp. 18–27.

24. Wolpert, D.H., Stacked Generalization, Neural Net-works, 1992, vol. 5, pp. 241–260.

25. Woods, K.S., Bowyer, K., and Kergelmeyer, W.P., Com-bination of Multiple Classifiers Using Local AccuracyEstimates, Proc. CVPR96, 1996, pp. 391–396.

26. Kittler, J. and Alkoot, F., Relationship of Sum and VoteFusion Strategies, Proc. Workshop on Multiple ClassifierSystems, 2001 (in press).

27. Davenport, G., Smith, T.A., and Pincever, N., CinematicPrimitives for Multimedia, IEEE Computer Graphicsand Applications, 1991, pp. 67–74.

28. Picard, R.W., Light-years from Lena: Video and ImageLibraries of the Future,” Proc. IEEE Int. Conf. ImageProcessing, 1995, vol. I, pp. 310–317.

29. Yusoff, Y., Christmas, W., and Kittler, J., A Study onAutomatic Shot Change Detection,” Proc. 3rd. Euro-pean Conf. Multimedia Applications, Services, and Tech-niques (ECMAST), Hutchison, D. and Schafer, R., Eds., no.1425 in LNCS, Springer, May 1998, pp. 177–189.

30. Huang, T.S. and Suen, C.Y., Combination of MultipleExperts for the Recognition of Unconstramed Handwrit-ten Numerals, IEEE Trans. Pattern Analysis andMachine Intelligence, 1995, vol. 17, pp. 90–94.

31. Zabih, R., Miller, J., and Mai, K., A Feature-based Algo-rithm for Detecting and Classifying Scene Breaks, Proc.ACM Multimedia’95, 1995, pp. 189–200.

32. Zabih, R., Miller, J., and Mai, K., A Feature-based Algo-rithm for Detecting and Classifying Production Effects,ACM Multimedia Systems, 1999, vol. 7, no. 2, pp. 119–128.

33. Hanjalic, A. and Zhang, H., Optimal Shot BoundaryDetection Based on Robust Statistical Models, Proc. 6thInt. Conf. on Multimedia Computing and Systems(ICMCS), Florence, IEEE, 1999, vol. 2, pp. 710–714.

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 11 No. 3 2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION 541

34. Sethi, I.K. and Patel, N., A Statistical Approach to SceneChange Detection,” Proc. IS&T/SPIE Storage andRetrieval for Image and Video Databases III, 1995,vol. 2420, pp. 329–337.

35. Ngo, C.W., Pong, T.C., and Chin, R.T., Camera BreakDetection by Partitioning of 2D Spatiotemporal Imagesin Mpeg Domain, Proc. 6th Int. Conf. on MultimediaComputing and Systems (ICMCS), Florence, IEEE,1999, vol. 2, pp. 750–755.

36. Kim, H., Park, S.-J., Kim, W.M., and Song, S.M.-H.,Processing of Partial Video Data for Detection of Wipes,Proc. IS&T/SPIE Conf. on Storage and Retrieval forImage and Video Databases VII, San Jose, CA, 1999,vol. 3656, pp. 280–289.

37. Hampapur, A., Jain, R., and Weymouth, T., Digital VideoSegmentation, Proc. ACM Multimedia’94, ACM Press,1994, pp. 357–364.

38. Vellaikal, A. and Kuo, C.-C.J., Joint Spatial-SpectralIndexing for Image Retrieval, Proc. IEEE Int. Conf.Image Processing, 1996, pp. 867–870.

39. Nagasaka, A. and Tanaka, Y., Automatic Video Indexingand Full-Video Search for Object Appearances, VisualDatabase Systems II, 1992, pp. 113–127.

40. Zhang, H., Kankanhalli, A., and Smoliar, S.W., Auto-matic Partitioning of Fullmotion Video, Multimedia Sys-tems, Springer, 1993, vol. 1, pp. 10–28.

41. Lienhart, R., Comparison of Automatic Shot BoundaryDetection Algorithms, Proc. IS&T/SPIE Conf. on Stor-age and Retrieval for Image and Video Databases VII,San Jose, CA, 1999, vol. 3656, pp. 290–301.

42. Kasturi, R. and Jain, R., Eds., Computer Vision: Princi-ples, IEEE Computer Society, 1991.

43. Tekalp, A.M., Digital Video Processing, Prentice-Hall,1995.

44. Takiran, C. and Delp, E.J., Video Scene Change Detec-tion Using the Generalized Sequence Trace, Proc. IEEEInt. Conf. Image Processing, 1998, pp. 2961–2964.

45. Dugad, R., Ratakonda, K., and Ahuja, N., Robust VideoShot Change Detection, IEEE Workshop on MultimediaSignal Processing, 1998.

46. Naphade, M.R., Mehrotra, R., Ferman, A.M., War-nick, J., Huang, T.S., and Tekalp, A.M., A High-perfor-mance Shot Boundary Detection Algorithm Using Mul-tiple Cues, Proc. IEEE Int. Conf. Image Processing,1998, vol. 2, pp. 884–887.

Josef Kittler. Graduated from theUniversity of Cambridge in ElectricalEngineering in 1971. Obtained hisPhD in Pattern Recognition in 1974and the ScD degree in 1991 both fromthe University of Cambridge. Profes-sor at the Department of Electronicand Electrical Engineering of SurreyUniversity, in charge of the Center forVision, Speech and Signal Processing.His current research interests includepattern recognition, image processing,and computer vision. Author of more than 400 papers andcoauthor of a book. Member of the Editorial Boards of Pat-tern Recognition Journal; Image and Vision Computing; Pat-tern Recognition Letters; Pattern Recognition and ArtificialIntelligence; Pattern Analysis and Applications; and MachineVision and Applications.

Bill Christmas. Obtained his PhDin Mathematics from the University ofSurrey. Holds a University Fellowshipin Technology Transfer in the Centerfor Vision, Speech, and Signal Pro-cessing at the University of Surrey.After studying Engineering Science atthe University of Oxford, he spent someyears with the British Broadcasting Cor-poration as a Research Engineer. Hethen moved to BP Research Interna-tional as a Senior Research Engineer,working on research topics that included hardware aspects ofparallel processing, real-time image processing and computervision. Other scientific interests include integration of machinevision algorithms to create complete applications. Author ofmore than 20 papers. Currently he is working on projects con-cerned with region-based video coding, and automated, content-based annotation of video and multimedia material.

Yusseri Yusoff. Graduated fromthe University of Essex in 1997. He iscurrently a PhD student at the Centerfor Vision, Speech, and Signal Pro-cessing of the University of Surrey,working on video processing.

David Windridge. Obtained hisBSci degree in Physics from the Uni-versity of Durham in 1993 and PhD inAstronomy from The University ofBristol in 1999. He is now ResearchFellow at the Center for Vision,Speech, and Signal Processing, Uni-versity of Surrey where he is workingon problems in multiple classifierfusion.

Terry Windeatt. Received theBSci degree in Applied Science fromUniversity of Sussex, MSci degree inElectronic Engineering from Univer-sity of California, and PhD degreefrom University of Surrey. After lec-turing in Control Engineering at King-ston University, UK, he worked in theUSA on intelligent systems at theResearch and Development Depart-ments of General Motors and XeroxCorporation in Rochester, NY (1976–1984). His industrial R&D experience is in modeling/simula-tion for intelligent automotive and office-copying applica-tions. Now lectures in Machine Intelligence at the Depart-ment of Electrical and Electronic Engineering at the Univer-sity of Surrey. He has worked on various research projects inthe Center for Vision, Speech, and Signal Processing, and hiscurrent research interests include neural nets, pattern recogni-tion, and computer vision.