Logistic Regression, AdaBoost and Bregman Distances

12
Proceedings of the Thirteenth Annual Conference on ComputationalLearning Theory, 2000. Logistic Regression, AdaBoost and Bregman Distances Michael Collins AT&T Labs Research Shannon Laboratory 180 Park Avenue, Room A253 Florham Park, NJ 07932 [email protected] Robert E. Schapire AT&T Labs Research Shannon Laboratory 180 Park Avenue, Room A203 Florham Park, NJ 07932 [email protected] Yoram Singer School of Computer Science & Engineering Hebrew University, Jerusalem 91904, Israel [email protected] Abstract. We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of op- timization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parame- terized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliary-function proof technique. As one of our sequential-update algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with itera- tive scaling. We conclude with preliminary experimental results. 1 INTRODUCTION We give a unified account of boosting and logistic regression in which we show that both learning problems can be cast in terms of optimization of Bregman distances. In our frame- work, the two problems become extremely similar, the only real difference being in the choice of Bregman distance: un- normalized relative entropy for boosting, and binary relative entropy for logistic regression. The fact that the two problems are so similar in our frame- work allows us to design and analyze algorithms for both si- multaneously. We are now able to borrow methods from the maximum-entropy literature for logistic regression and apply them to the exponential loss used by AdaBoost, especially convergence-proof techniques. Conversely, we can now eas- ily adapt boosting methods to the problem of minimizing the logistic loss used in logistic regression. The result is a family of new algorithms for both problems together with conver- gence proofs for the new algorithms as well as AdaBoost. For both AdaBoost and logistic regression, we attempt to choose the parameters or weights associated with a given family of functions called features or weak hypotheses. Ada- Boost works by sequentially updating these parameters one by one, whereas methods for logistic regression, most notably iterative scaling [9, 10], are iterative but update all parameters in parallel on each iteration. Our first new algorithm is a method for optimizing the exponential loss using parallel updates. It seems plausible that a parallel-update method will often converge faster than a sequential-update method, provided that the number of features is not so large as to make parallel updates infeasible. Some preliminary experiments suggest that this is the case. Our second algorithm is a parallel-update method for the logistic loss. Although parallel-update algorithms are well known for this function, the updates that we derive are new, and preliminary experiments indicate that these new updates may also be much faster. Because of the unified treatment we give to the exponential and logistic loss functions, we are able to present and prove the convergence of the algorithms for these two losses simultaneously. The same is true for the other algorithms presented in this paper as well. We next describe and analyze sequential-update algo- rithms for the two loss functions. For exponential loss, this algorithm is equivalent to the AdaBoost algorithm of Freund and Schapire [13]. By viewing the algorithm in our frame- work, we are able to prove that AdaBoost correctly converges to the minimum of the exponential loss function. This is a new result: Although Kivinen and Warmuth [16] and Ma- son et al. [19] have given convergence proofs for AdaBoost, their proofs depend on assumptions about the given mini- mization problem which may not hold in all cases. Our proof holds in general without assumptions. Our unified view leads instantly to a sequential-update algorithm for logistic regression that is only a minor modifi- cation of AdaBoost and which is very similar to one proposed by Duffy and Helmbold [12]. Like AdaBoost, this algorithm can be used in conjunction with any classification algorithm, usually called the weak learning algorithm, that can accept a distribution over examples and return a weak hypothesis with low error rate with respect to the distribution. How- ever, this new algorithm provably minimizes the logistic loss rather than the arguably less natural exponential loss used by AdaBoost. Another potentially important advantage of this new al- gorithm is that the weights that it places on examples are bounded in 0 1 . This suggests that it may be possible to use the new algorithm in a setting in which the boosting algorithm selects examples to present to the weak learning algorithm by filtering a stream of examples (such as a very large dataset). As pointed out by Watanabe [22] and Domingo and Watan- abe [11], this is not possible with AdaBoost since its weights may become extremely large. They provide a modification of AdaBoost for this purpose in which the weights are trun- cated at 1. The new algorithm may be a viable and cleaner alternative.

Transcript of Logistic Regression, AdaBoost and Bregman Distances

Proceedings of the Thirteenth Annual Conference on ComputationalLearning Theory, 2000.

Logistic Regression, AdaBoost and Bregman Distances

Michael CollinsAT&T Labs � Research

Shannon Laboratory180 Park Avenue, Room A253

Florham Park, NJ [email protected]

Robert E. SchapireAT&T Labs � Research

Shannon Laboratory180 Park Avenue, Room A203

Florham Park, NJ [email protected]

Yoram SingerSchool of Computer Science & EngineeringHebrew University, Jerusalem 91904, Israel

[email protected]

Abstract. We give a unified account of boosting and logisticregression in which each learning problem is cast in terms of op-timization of Bregman distances. The striking similarity of thetwo problems in this framework allows us to design and analyzealgorithms for both simultaneously, and to easily adapt algorithmsdesigned for one problem to the other. For both problems, we givenew algorithms and explain their potential advantages over existingmethods. These algorithms can be divided into two types basedon whether the parameters are iteratively updated sequentially (oneat a time) or in parallel (all at once). We also describe a parame-terized family of algorithms which interpolates smoothly betweenthese two extremes. For all of the algorithms, we give convergenceproofs using a general formalization of the auxiliary-function prooftechnique. As one of our sequential-update algorithms is equivalentto AdaBoost, this provides the first general proof of convergence forAdaBoost. We show that all of our algorithms generalize easily tothe multiclass case, and we contrast the new algorithms with itera-tive scaling. We conclude with preliminary experimental results.

1 INTRODUCTIONWe give a unified account of boosting and logistic regressionin which we show that both learning problems can be cast interms of optimization of Bregman distances. In our frame-work, the two problems become extremely similar, the onlyreal difference being in the choice of Bregman distance: un-normalized relative entropy for boosting, and binary relativeentropy for logistic regression.

The fact that the two problems are so similar in our frame-work allows us to design and analyze algorithms for both si-multaneously. We are now able to borrow methods from themaximum-entropy literature for logistic regression and applythem to the exponential loss used by AdaBoost, especiallyconvergence-proof techniques. Conversely, we can now eas-ily adapt boosting methods to the problem of minimizing thelogistic loss used in logistic regression. The result is a familyof new algorithms for both problems together with conver-gence proofs for the new algorithms as well as AdaBoost.

For both AdaBoost and logistic regression, we attemptto choose the parameters or weights associated with a givenfamily of functions called features or weak hypotheses. Ada-Boost works by sequentially updating these parameters oneby one, whereas methods for logistic regression,most notablyiterative scaling [9, 10], are iterative but update all parametersin parallel on each iteration.

Our first new algorithm is a method for optimizing theexponential loss using parallel updates. It seems plausiblethat a parallel-update method will often converge faster than

a sequential-update method, provided that the number offeatures is not so large as to make parallel updates infeasible.Some preliminary experiments suggest that this is the case.

Our second algorithm is a parallel-update method for thelogistic loss. Although parallel-update algorithms are wellknown for this function, the updates that we derive are new,and preliminary experiments indicate that these new updatesmay also be much faster. Because of the unified treatmentwe give to the exponential and logistic loss functions, we areable to present and prove the convergence of the algorithmsfor these two losses simultaneously. The same is true for theother algorithms presented in this paper as well.

We next describe and analyze sequential-update algo-rithms for the two loss functions. For exponential loss, thisalgorithm is equivalent to the AdaBoost algorithm of Freundand Schapire [13]. By viewing the algorithm in our frame-work, we are able to prove that AdaBoost correctly convergesto the minimum of the exponential loss function. This is anew result: Although Kivinen and Warmuth [16] and Ma-son et al. [19] have given convergence proofs for AdaBoost,their proofs depend on assumptions about the given mini-mization problem which may not hold in all cases. Our proofholds in general without assumptions.

Our unified view leads instantly to a sequential-updatealgorithm for logistic regression that is only a minor modifi-cation of AdaBoost and which is very similar to one proposedby Duffy and Helmbold [12]. Like AdaBoost, this algorithmcan be used in conjunction with any classification algorithm,usually called the weak learning algorithm, that can accepta distribution over examples and return a weak hypothesiswith low error rate with respect to the distribution. How-ever, this new algorithm provably minimizes the logistic lossrather than the arguably less natural exponential loss used byAdaBoost.

Another potentially important advantage of this new al-gorithm is that the weights that it places on examples arebounded in

�0 � 1 � . This suggests that it may be possible to use

the new algorithm in a setting in which the boosting algorithmselects examples to present to the weak learning algorithm byfiltering a stream of examples (such as a very large dataset).As pointed out by Watanabe [22] and Domingo and Watan-abe [11], this is not possible with AdaBoost since its weightsmay become extremely large. They provide a modificationof AdaBoost for this purpose in which the weights are trun-cated at 1. The new algorithm may be a viable and cleaneralternative.

We next describe a parameterized family of iterative al-gorithms that includes both parallel- and sequential-updatealgorithms and that also interpolates smoothly between thetwo extremes. The convergence proof that we give holds forthis entire family of algorithms.

Although most of this paper considers only the binarycase in which there are just two possible labels associatedwith each example, it turns out that the multiclass case re-quires no additional work. That is, all of the algorithmsand convergence proofs that we give for the binary case turnout to be directly applicable to the multiclass case withoutmodification.

For comparison, we also describe the generalized iterativescaling algorithm of Darroch and Ratcliff [9]. In rederivingthis procedure in our setting, we are able to relax one of themain assumptions usually required by this algorithm.

The paper is organized as follows: Section 2 describesthe boosting and logistic regression models as they are usu-ally formulated. Section 3 gives background on optimiza-tion using Bregman distances, and Section 4 then describeshow boosting and logistic regression can be cast within thisframework. Section 5 gives our parallel-update algorithmsand proofs of their convergence, while Section 6 gives thesequential-update algorithms and convergence proofs. Theparameterized family of iterative algorithms is described inSection 7. The extension to multiclass problems is givenin Section 8. In Section 9, we contrast our methods withiterative scaling. In Section 10, we give some preliminaryexperiments.

Previous work. Variants of our sequential-update algo-rithms fit into the general family of “arcing” algorithms pre-sented by Breiman [4, 3], as well as Mason et al.’s “AnyBoost”family of algorithms [19]. The information-geometric viewthat we take also shows that the algorithms we study, includ-ing AdaBoost, fit into a family of algorithms described in1967 by Bregman [2] for satisfying a set of constraints.

Our work is based directly on the general setting of Laf-ferty, Della Pietra and Della Pietra [18] in which one attemptsto solve optimization problems based on general Bregmandistances. They gave a method for deriving and analyzingparallel-update algorithms in this setting through the use ofauxilliary functions. All of our algorithms and convergenceproofs are based on this method.

Our work builds on several previous papers which havecompared boosting approaches to logistic regression. Fried-man, Hastie and Tibshirani [14] first noted the similarity be-tween the boosting and logistic regression loss functions, andderived the sequential-update algorithm LogitBoost for thelogistic loss. However, unlike our algorithm, theirs requiresthat the weak learner solve least-squares problems ratherthan classification problems. Another sequential-update al-gorithm for a different but related problem was proposed byCesa-Bianchi, Krogh and Warmuth [5].

Duffy and Helmbold [12] gave conditions under which aloss function gives a boosting algorithm. They showed thatminimizing logistic loss does lead to a boosting algorithmin the PAC sense, which suggests that our algorithm for thisproblem, which is very close to theirs, may turn out also tohave the PAC boosting property.

Lafferty [17] went further in studying the relationship

between logistic regression and the exponential loss throughthe use of a family of Bregman distances. However, thesetting described in his paper apparently cannot be extendedto precisely include the exponential loss. The use of Bregmandistances that we describe has important differences leadingto a natural treatment of the exponential loss and a new viewof logistic regression.

Our work builds heavily on that of Kivinen and War-muth [16] who, along with Lafferty, were the first to makea connection between AdaBoost and information geometry.They showed that the update used by AdaBoost is a form of“entropy projection.” However, the Bregman distance thatthey used differed slightly from the one that we have chosen(normalized relative entropy rather than unnormalized rela-tive entropy) so that AdaBoost’s fit in this model was not quitecomplete; in particular, their convergence proof depended onassumptions that do not hold in general. Kivinen and War-muth also described updates for general Bregman distancesincluding, as one of their examples, the Bregman distancethat we use to capture logistic regression.

2 BOOSTING, LOGISTIC MODELS ANDLOSS FUNCTIONS

Let���������

1 �� 1 ��� � � � ����� �� � �� be a set of training exam-ples where each instance

���belongs to a domain or instance

space � , and each label ����� � 1 ��� 1 � .We assume that we are also given a set of real-valued func-

tions on � , � 1 � ��� � ����� . Following convention in the MaxEntliterature, we call these functions features; in the boostingliterature, these would be called weak or base hypotheses.

We study the problem of approximating the � ’s using alinear combination of features. That is, we are interested inthe problem of finding a vector of parameters �"! �

suchthat #%$ ���&� �(' �)+*

1 , ) � ) ����� is a “good approximation” of � . How we measure the goodness of such an approximationvaries with the task that we have in mind.

For classification problems, it is natural to try to matchthe sign of #-$ ���&� to � , that is, to attempt to minimize�. � *

1

� � � #%$ ����� 0/ 0 � � �1

where� � 1 � � is 1 if

1is true and 0 otherwise. Although min-

imization of the number of classification errors may be aworthwhile goal, in its most general form, the problem isintractable (see, for instance, [15]). It is therefore oftenadvantageous to instead minimize some other nonnegativeloss function. For instance, the boosting algorithm Ada-Boost [13, 20] is based on the exponential loss�. � *

1

exp 2 � � # $ ��� � �3 � �2

It can be verified that Eq. (1) is upper bounded by Eq. (2);however, the latter loss is much easier to work with as demon-strated by AdaBoost. Briefly, on each of a series of rounds,AdaBoost uses an oracle or subroutine called the weak learn-ing algorithm to pick one feature (weak hypothesis) � ) , andthe associated parameter , ) is then updated. It has been notedby Breiman [3, 4] and various later authors that both of these

2

steps are done in such a way as to (approximately) cause thegreatest decrease in the exponential loss. In this paper, weshow for the first time that AdaBoost is in fact a provably ef-fective method for finding parameters which minimize theexponential loss (assuming the weak learner always choosesthe “best” � ) ).

We also give an entirely new algorithm for minimizing ex-ponential loss in which, on each round, all of the parameters, ) are updated in parallel rather than one at a time. Our hopeis that this parallel-update algorithm will be faster than thesequential-update algorithm; see Section 10 for preliminaryexperiments in this regard.

Instead of using #-$ as a classification rule, we mightinstead postulate that the � ’s were generated stochasticallyas a function of the

� �’s and attempt to use # $ ��� to estimate

the probability of the associated label . A very natural andwell-studied way of doing this is to pass # $ through a logisticfunction, that is, to use the estimate

Pr� � � 1

� � � � 11 ������������ �

The likelihood of the labels occuring in the sample then is��� *1

1

1 � exp 2 � � #%$ ���&� �3 �Maximizing this likelihood then is equivalent to minimizingthe log loss of this model�. � *

1

ln 2 1 � exp 2 � � #%$ ����� 3�3 � �3

Generalized and improved iterative scaling [9, 10] arepopular parallel-update methods for minimizing this loss. Inthis paper, we give an alternative parallel-update algorithmwhich we compare to iterative scaling techniques in prelimi-nary experiments in Section 10.

3 BREGMAN-DISTANCE OPTIMIZATION

In this section, we give background on optimization usingBregman distances. This will form the unifying basis forour study of boosting and logistic regression. The particularset-up that we follow is taken primarily from Lafferty, DellaPietra and Della Pietra [18].

Let � : ∆ � !be a continuously differentiable and

strictly convex function defined on a closed, convex set ∆ �! ��. The Bregman distance associated with � is defined for� ��� � ∆ to be��� 2 ��� � 3 �� � � � � � � � ��� � � � �� � � � � �

For instance, when

� � � � �. � *1 � ln � � �

4 � �is the (unnormalized) relative entropy

D ! 2 ��� � 3 � �. � *1

" � ln

" �# �%$ � # � � � $ �

It can be shown that, in general, every Bregman distanceis nonnegative and is equal to zero if and only if its twoarguments are equal.

There is a natural optimization problem that can be as-sociated with a Bregman distance, namely, to find the vector� �

∆ that is closest to a given vector � 0�

∆ subject to a setof linear constraints. These constraints are specified by an&('*) matrix + and a vector ˜� �

∆. The vectors � satisfyingthese constraints are those for which � T + �

˜� T + . Thus,the problem is to find

arg min�-,/. � � 2 ��� � 0 3where 0 ��21 � �

∆ : � T + �˜� T +43�� �

5 The “convex dual” of this problem gives an alternative

formulation. Here, the problem is to find the vector of aparticular form that is closest to a given vector ˜� . The formof such vectors is defined via the Legendre transform, written576 � � (or simply 576 � when � is clear from context):576 � � ��

arg min�-, ∆2 � � 2 �8� � 3 � 5 � � 3 �

Using calculus, this can be seen to be equivalent to� � � 576 � � � � � � � 5 � �6

For instance, when�9�

is unnormalized relative entropy, itcan be verified using calculus that� 576 � � � # � � �;:�< � �

7 From Eq. (6), it is useful to note that576 �>= 6 � � � 5 � = 6 ��� �

8 For a given &?'@) matrix + and vector � 0

�∆, we

consider vectors obtained by taking the Legendre transformof a linear combination of columns of + with the vector � 0,that is, vectors in the setA �� � � + 6 � 0

� � ! � � � �9

The dual optimization problem now can be stated to be theproblem of finding

arg min� , B ��� 2 ˜�8� � 3where

Ais the closure of

A.

The remarkable fact about these two optimization prob-lems is that their solutions are the same, and, moreover, thissolution turns out to be the unique point at the intersectionof0

andA

. We take the statement of this theorem fromLafferty, Della Pietra and Della Pietra [18]. The result ap-pears to be due to Csiszar [6, 7] and Topsoe [21]. A proof forthe case of (normalized) relative entropy is given by DellaPietra, Della Pietra and Lafferty [10]. See also Csiszar’ssurvey article [8].

Theorem 1 Let p, q0, M, ∆, � ,� �

,0

andA

be as above.Assume

�C� 2 p � q0 3EDGF . Then there exists a uniqueq H � ∆ satisfying:

1. q H � 0JI A3

2.� � 2 p � q 3 � � � 2 p � q H 3 � � � 2 q H � q 3 for anyp� 0

and q��A

3. q H � arg minq, B ��� 2 p � q 3

4. q H � arg minp,/. � � 2 p � q0 3 .

Moreover, any one of these four properties determines q Huniquely.

This theorem will be extremely useful in proving theconvergence of the algorithms described below. We willshow in the next section how boosting and logistic regressioncan be viewed as optimization problems of the type givenin part 3 of the theorem. Then, to prove optimality, we onlyneed to show that our algorithms converge to a point in

07I A.

4 BOOSTING AND LOGISTICREGRESSION REVISITED

We return now to the boosting and logistic regression prob-lems outlined in Section 2, and show how these can be castin the form of the optimization problems outlined above.

Recall that for boosting, our goal is to find such that�. � *1

exp

��� �

�.)+*

1, ) � ) ��� � ��� �

10 is minimized, or, more precisely, if the minimum is not at-tained at a finite , then we seek a procedure for finding asequence 1 �� 2 � � ��� which causes this function to convergeto its infimum. For shorthand, we call this the ExpLoss prob-lem.

To view this problem in the form given in Section 3, welet ˜� ���

, � 0���

(the all 0’s and all 1’s vectors). Welet � ) � � � ) ���&� , from which it follows that

� + � �' �)+*1 , ) � � ) ����� . The space ∆

� ! ��. Finally, we take �

to be as in Eq. (4) so that� �

is the unnormalized relativeentropy.

As noted earlier, in this case, 5 6 � is as given in Eq. (7).In particular, this means thatA ��� � � ! ��������� # � � exp

���

�.)+*

1, ) � � ) ��� � �� �� � ! ��� �� �

Furthermore, it is trivial to see that

D ! 2 � � � 3 � �. � *1

# � �11

so that D ! 2 � � � + 6 � 0 3 is equal to Eq. (10). Thus,minimizing D ! 2 � � � 3 over � � A

is equivalent to mini-mizing Eq. (10). By Theorem 1, this is equivalent to finding� � A satisfying the constraints�. � *

1

# � � ) � �. � *1

# � � � ) ���&� � 0�12

for � � 1 ��� ��� � ) .

Logistic regression can be reduced to an optimizationproblem of this form in nearly the same way. Recall that hereour goal is to find (or a sequence of ’s) which minimize�. � *

1

ln

��1 � exp

��� �

�.)+*

1, ) � ) ��� � ������ � �

13 For shorthand, we call this the LogLoss problem. We define˜� and + exactly as for exponential loss. The vector � 0 is stillconstant, but now is defined to be

�1 � 2 � , and the space ∆

is now restricted to be�0 � 1 �

�. These are minor differences,

however. The only important difference is in the choice ofthe function � , namely,

� � � � �. � *1

2 � ln � � �1 � � ln

�1 � � 3 �

The resulting Bregman distance is

D � 2 �8� � 3 � �. � *1

" � ln

" �# � $ � �1 � � ln

"1 � �1 � # � $ $ �

Trivially,

D � 2 � � � 3 � ��. � *

1

ln�1 � # � � �

14 For this choice of � , it can be verified using calculus that

� 576 � � � # � � � : <1 � # � � # � ��;: < �

15 so thatA � � � � �

0 � 1 �� ������ # � ���

�� �.)+*

1, ) � � ) ����� �� � � ! � � �� �

where����� � �

1 � � � � 1. Thus, D � 2 � � � + 6 � 0 3 isequal to Eq. (13) so minimizing D � 2 � � � 3 over � � Ais equivalent to minimizing Eq. (13). As before, this is thesame as finding # � A satisfying the constraints in Eq. (12).

5 PARALLEL OPTIMIZATION METHODS

In this section, we describe a new algorithm for theExpLoss and LogLoss problems using an iterative methodin which all weights , ) are updated on each iteration. Thealgorithm is shown in Fig. 1. The algorithm can be usedwith any function � satisfying certain conditions describedbelow; in particular, we will see that it can be used withthe choices of � given in Section 4. Thus, this is really asingle algorithm that can be used for both loss-minimizationproblems by setting the parameters appropriately. Note that,without loss of generality, we assume in this section that forall instances � , ' �)+*

1

� � ) � / 1.The algorithm is very simple. On each iteration, the

vector ��� is computed as shown and added to the parametervector � . We assume for all our algorithms that the inputsare such that infinite-valued updates never occur.

This algorithm is new for both minimization problems.Optimization methods for ExpLoss, notably AdaBoost, have

4

Parameters: ∆ � ! ��� : ∆ � !satisfying Assumptions 1 and 2� 0

�∆ such that

� � 2 � � � 0 3 D FInput: Matrix + � � � 1 � 1 �

� � � where, for all � ,' �)+*1

� � ) � / 1

Output: 1 � 2 ��� � � such that

lim����� �C� 2 � � � + � 6 � 0 3 � inf$ , !�� ��� 2 � � � + 6 � 0 3 �Let 1

���For

� �1 � 2 ��� ��� :� � � �(� + � 6 � � 0� For � � 1 ��� � � � ) : ��� ) � .�

:sign � <��� * � 1

# �� � � � ) � ��� ) � .�:sign � <�� * � 1

# �� � � � ) �� �� ) � 1

2ln

� ��� ) ��� )��� Update parameters: � � 1� � � � �

Figure 1: The parallel-update optimization algorithm.

generally involved updates of one feature at a time. Parallel-update methods for LogLoss are well known (see, for exam-ple, [9, 10]). However, our updates take a different form fromthe usual updates derived for logistic models.

A useful point is that the distribution � � � 1 is a simplefunction of the previous distribution � � . By Eq. (8),� � � 1

� � + � � � � � + 6 � 0� � + � � 6 �+� + � 6 � 0 � � + � � 6 � � � (16)

This gives

# � � 1 � � � # �� � exp � � ' �)+*1

� �� ) � )��# �� ��� � 1 � # �� � exp � ' �)+*1

� �� ) � )�� � # �� ��� � 1

�17

for ExpLoss and LogLoss respectively.We will prove next that the algorithm given in Fig. 1 con-

verges to optimality for either loss. We prove this abstractlyfor any matrix + and vector � 0, and for any function �satisfying the following assumptions:

Assumption 1 For any v� ! �

, q�

∆,�C� 2 0 � v 6 q 3 � �C� 2 0 � q 3 /�. � *

1

# � � � �;:�< � 1 �Assumption 2 For any � DJF , the set1

q�

∆� � � 2 0 � q 3 / � 3

is bounded.

We will show later that the choices of � given in Sec-tion 4 satisfy these assumptions which will allow us to proveconvergence for ExpLoss and LogLoss.

To prove convergence,we use the auxiliary-function tech-nique of Della Pietra, Della Pietra and Lafferty [10]. Veryroughly, the idea of the proof is to derive a nonnegative lowerbound called an auxiliary function on how much the lossdecreases on each iteration. Since the loss never increasesand is lower bounded by zero, the auxiliary function mustconverge to zero. The final step is to show that when theauxiliary function is zero, the constraints defining the set

0must be satisfied, and therefore, by Theorem 1, we must haveconverged to optimality.

More formally, we define an auxiliary function for a se-quence � 1 ��� 2 � � ��� and matrix + to be a continuous function�

: ∆ � !satisfying the two conditions:� � 2 � � � � � 1 3 � � � 2 � � � � 3 / � � � � 0/ 0

�18

and � � � � 0 �G� T + � � � �19

Before proving convergence of specific algorithms, weprove the following lemma which shows, roughly, that if asequence has an auxiliary function, then the sequence con-verges to the optimum point � H . Thus, proving convergenceof a specific algorithm reduces to simply finding an auxiliaryfunction.

Lemma 2 Let�

be an auxiliary function for q1 � q2 � ��� � andmatrix M. Assume the q � ’s lie in a compact subspace of

Awhere

Ais as in Eq. (9); in particular, this will be the case if

Assumption 2 holds and� � 2 0 � q1 3 DJF . Then

lim����� q � � q H ��arg min

q, B ��� 2 0 � q 3 �

Proof: By condition (18),� � 2 � � � � 3 is a nonincreasing

sequence which is bounded below by zero. Therefore, the se-quence of differences

�9� 2 � � � � � 1 3 � ��� 2 � � � � 3 mustconverge to zero. By condition (18), this means that

� � � � must also converge to zero. Because we assume that the � � ’slie in a compact space, the sequence of � � ’s must have a sub-sequence converging to some point ˆ� � ∆. By continuity of�

, we have� �

ˆ� � 0. Therefore, ˆ� � 0 by condition (19),where

0is as in Eq. (5). On the other hand, ˆ� is the limit of

a sequence of points inA

so ˆ� � A. Thus, ˆ� � 0 I A

soˆ� � � H by Theorem 1.

This argument and the uniqueness of � H show that the� � ’s have only a single limit point � H . Suppose that the entiresequence did not converge to � H . Then we could find anopen set

�containing � H such that

� � 1 ��� 2 � � ��� � � � containsinfinitely many points and therefore has a limit point whichmust be in the closed set ∆ � � and so must be different from�;H . This, we have already argued, is impossible. Therefore,the entire sequence converges to � H .

We can now apply this lemma to prove the convergenceof the algorithm of Fig. 1.

Theorem 3 Let � satisfy Assumptions 1 and 2, and assumethat

�C� 2 0 � q0 3 D F . Let the sequences 1 �� 2 ��� ��� andq1 � q2 � ��� � be generated by the algorithm of Fig. 1. Then

lim����� q � � arg minq, B � � 2 0 � q 3

5

whereA

is as in Eq. (9). That is,

lim����� �C� 2 0 � �M � 6 q0 3 � inf$ , ! � �C� 2 0 � �

M 6 q0 3 �Proof: Let �) � � � .�

:sign � <�� * � 1

# � � � ) � �) � � � .�:sign � <�� * � 1

# � � � ) �so that

��� ) � �) � � � and ��� ) � �) � � � . We claim that

the function� � � � ��.)+*

1

" � �) � � �� �) � � $ 2

is an auxiliary function for � 1 ��� 2 � ��� � . Clearly,�

is continu-ous and nonpositive.

Let � � ) ��sign

� � ) . We can upper bound the change in� � 2 � � � � 3 on round�

by� � � � as follows:� � 2 � � � � � 1 3 � � � 2 � � � � 3� �C� 2 � � � + � � 6 � � 3 � ��� 2 � � � � 3 (20)

/�. � *

1

# �� ���� exp

���

�.)+*

1

� �� ) � ) �� � 1�� (21)

��. � *

1

# �� � �� exp

���

�.)+*

1

� �� ) � � ) � � ) � �� � 1��/

�. � *1

# �� ���� �.)+*1

� � ) � � � ����� �� < � � 1 �� (22)

� �.)+*

1

2 ��� ) � ��� �� � � ��� ) � � �� � � ��� ) � ��� ) 3 (23)

� ��.)+*

1

" � ��� ) �� ��� ) $ 2 � � � � � � (24)

Eqs. (20) and (21) follow from Eq. (16) and Assumption 1,respectively. Eq. (22) uses the fact that, for any

� ) ’s and for )�� 0 with' ) ) / 1, we have

exp

�� .) ) � ) �� � 1

�exp

�� .) ) � ) � 0 � �� 1 � .

) ) �� �� � 1

/ .) ) � � < �

��1 � .

) ) �� � 1� .

) ) � � � < � 1 (25)

by Jensen’s inequality applied to the convex function � � .Eq. (23) uses the definitions of

��� ) and ��� ) , and Eq. (24)

uses our choice of ��� (indeed, � � was chosen specifically tominimize Eq. (23)).

If� � � � 0 then for all � , �) � � � �) � � , that is,

0� �) � � � �) � � � �. � *

1

# � � � ) � � ) � � �. � *1

# � � ) �Thus,

�is an auxiliary function for � 1 ��� 2 � ��� � . The theorem

now follows immediately from Lemma 2.To apply this theorem to the ExpLoss and LogLoss prob-

lems, we only need to verify that Assumptions 1 and 2 aresatisfied. For ExpLoss, Assumption 1 holds with equality.For LogLoss,

D � 2 � �@5 6 � 3 � D � 2 � � � 3�

�. � *1

ln

"1 � # �

1 � � 576 � ��$�

�. � *1

ln 2 1 � # � � # � � � : < 3/

�. � *1

2 � # � � # � � � : < 3 �The first and second equalities use Eqs. (14) and (15), respec-tively. The final inequality uses 1 � � / � � for all

�.

Assumption 2 holds trivially for LogLoss since ∆�

�0 � 1 �

�is bounded. For ExpLoss, if D ! 2 � � � 3 / � then�. � *

1

# � / �which clearly defines a bounded subset of

! ��.

6 SEQUENTIAL ALGORITHMS

In this section, we describe another algorithm for the sameminimization problems described in Section 4. However,unlike the algorithm of Section 5, the one that we presentnow only updates the weight of one feature at a time. Whilethe parallel-update algorithm may give faster convergencewhen there are not too many features, the sequential-updatealgorithm can be used when there are a very large numberof features using an oracle for selecting which feature toupdate next. For instance, AdaBoost, which is essentiallyequivalent to the sequential-update algorithm for ExpLoss,uses an assumed weak learning algorithm to select a weakhypothesis, i.e., one of the features. The sequential algorithmthat we present for LogLoss can be used in exactly the sameway. The algorithm is shown in Fig. 2.

Theorem 4 Given the assumptions of Theorem 3, the al-gorithm of Fig. 2 converges to optimality in the sense ofTheorem 3.

Proof: For this theorem, we use the auxiliary function� � � � ���� � �. � *1

# � � 2

� max)� �. � *

1

# � � ) � 2

��. � *

1

# � �6

Parameters: (same as in Fig. 1)Input: Matrix + � � � 1 � 1 �

� � �Output: (same as in Fig. 1)Let 1

���For

� �1 � 2 ��� ��� :� � � �(� + � 6 � 0� � � � arg max) �����

�. � *1

# �� � � ) �������� � � �. � *1

# �� � � ) ��� � � �. � *1

# �� ���� � � 12

ln

" � � � � �� � � � � $� � �� ) ��� � � if � � � �0 else� Update parameters: � � 1

� � � � �Figure 2: The sequential-update optimization algorithm.

This function is clearly continuous and nonpositive. We havethat ��� 2 � � � � � 1 3 � ��� 2 � � � � 3

/�. � *

1

# �� � �� exp

���

�.)+*

1

� �� ) � ) �� � 1 ���

�. � *1

# �� � 2 exp 2 � � � � ) 3 � 1 3 (26)

/�. � *

1

# �� � " 1 � � ) 2

� �� � 1 � � ) 2

� � � 1$ (27)

� � � � � �2

� �� � � � � � �2

� � � � � (28)

� � � 2� � � 2� � � � � � � � � (29)

where Eq. (27) uses the convexity of � �� � , and Eq. (29)uses our choice of � � (as before, we chose � � to minimizethe bound in Eq. (28)).

If� � � � 0 then

0�

max) ������. � *

1

# � � ) �����so' � # � � ) � 0 for all � . Thus,

�is an auxiliary function

for � 1 � � 2 ��� ��� and the theorem follows immediately fromLemma 2.

As mentioned above, this algorithm is essentially equiv-alent to AdaBoost, specifically, the version of AdaBoost firstpresented by Freund and Schapire [13]. In AdaBoost, oneach iteration, a distribution � � over the training examplesis computed and the weak learner seeks a weak hypothesiswith low error with respect to this distribution. The algorithmpresented in this section assumes that the space of weak hy-potheses consists of the features � 1 � ��� � ��� � , and that the

weak learner always succeeds in selecting the feature withlowest error (or, more accurately, with error farthest from1 � 2). Translating to our notation, the weight � � � � assignedto example

��� � �+ � by AdaBoost is exactly equal to # �� � � � � ,and the weighted error of the

�-th weak hypothesis is equal

to12

"1 � � �� � $ �

Theorem 4 then is the first proof that AdaBoost alwaysconverges to the minimum of the exponential loss (assumingan idealized weak learner of the form above). Note that when�;H �� � , this theorem also tells us the exact form of lim � � .However, we do not know what the limiting behavior of � �is when � H ��� , nor do we know about the limiting behaviorof the parameters � (whether or not � H � � ).

We have also presented in this section a new algorithm forlogistic regression. In fact, this algorithm is the same as onegiven by Duffy and Helmbold [12] except for the choice of� � . In practical terms, very little work would be required toalter an existing learning system based on AdaBoost so thatit uses logistic loss rather than exponential loss—the onlydifference is in the manner in which � � is computed from � .We can even do this for systems based on “confidence-rated”boosting [20] in which � � and � � are chosen together on eachround to minimize Eq. (26) rather than an approximation ofthis expression as used in the algorithm of Fig. 2. (Note thatthe proof of Theorem 4 can easily be modified to prove theconvergence of such an algorithm using the same auxiliaryfunction.)

7 A PARAMETERIZED FAMILY OFITERATIVE ALGORITHMS

In previous sections, we described separate parallel- andsequential-update algorithms. In this section, we describe aparameterized family of algorithms that includes the parallel-update algorithm of Section 5 as well as a sequential-updatealgorithm that is different from the one in Section 6. Thisfamily of algorithms also includes other algorithms that maybe more appropriate than either in certain situations as weexplain below.

The algorithm, which is shown in Fig. 3, is similar tothe parallel-update algorithm of Fig. 1. On each round, thequantities

��� ) and ��� ) are computed as before, and the

vector ��� is computed as ��� was computed in Fig. 1. Now,however, this vector � � is not added directly to � . Instead,another vector � � is selected which provides a “scaling” ofthe features. This vector is chosen to maximize a measureof progress while restricted to belong to the set � M. Theallowed form of these scaling vectors is given by the set � ,a parameter of the algorithm; � M is the restriction of � tothose vectors � satisfying the constraint that for all � ,�.

)+*1 � )

� � ) � / 1 �The parallel-update algorithm of Fig. 1 is obtained by

choosing � � � � � and assuming that' ) � � ) � / 1 for

all � . (Equivalently, we can make no such assumption, andchoose � � � � � � ��� 0 � .)

7

Parameters: (same as in Fig. 1)� � ! ��Input: Matrix + � ! � � �

satisfying the condition that� � ��� � � � M for which � ) � 0 where� M�� � � � � � � � :

.) � )

� � ) � / 1 �Output: (same as in Fig. 1)Let 1

���For

� �1 � 2 ��� ��� :� � � �(� + � 6 � � 0� For � � 1 ��� � � � ) : ��� ) � .�

:sign � <�� * � 1

# �� � � � ) � ��� ) � .�:sign � <��� * � 1

# �� � � � ) �� �� ) � 1

2ln

� ��� ) ��� ) �� � � � arg max� ,�� M

�.)+*

1 � )" � ��� ) �

� ��� ) $ 2

� � � :� �� ) � � �� ) � �� )� Update parameters: � � 1

� � � � �Figure 3: A parameterized family of iterative optimizationalgorithms.

We can obtain a sequential-update algorithm by choosing� to be the set of unit vectors (i.e., with one component equalto 1 and all others equal to 0), and assuming that � ) �

� � 1 ��� 1 � for all � ��� . The update then becomes� �� ) � � � �� ) if � � � �0 else

where � � � arg max) ���� � ��� ) �� ��� ) ���� �

Another interesting case is when we assume that' ) 2� ) / 1 for all � . It is then natural to choose

� �(� � � ! �� � ��� � � � 2 � 1 �which ensures that � M

� � . Then the maximization over� M can be solved analytically giving the update� �� ) ��� ) � �� )� � � ���2

where � ) � � � ��� ) �� ��� ) � 2

. (This idea generalizes

easily to the case in which' ) � ) / 1 and

� � � � � � � 1 forany dual norms and # .)

A final case is when we do not restrict the scaling vectorsat all, i.e., we choose � � ! ��

. In this case, the maximizationproblem that must be solved to choose each � � is a linearprogramming problem with ) variables and & constraints.

We now prove the convergence of this entire family ofalgorithms.

Theorem 5 Given the assumptions of Theorem 3, the al-gorithm of Fig. 3 converges to optimality in the sense ofTheorem 3.

Proof: We use the auxiliary function� � � � � max� ,�� M

�.)+*

1 � )" � �) � � �

� �) � � $ 2

where �) and

�) are as in Theorem 3. This function iscontinuous and nonpositive. We can bound the change in��� 2 � � � � 3 using the same technique given in Theorem 3:� � 2 � � � � � 1 3 � � � 2 � � � � 3

/�. � *

1

# �� ���� exp

���

�.)+*

1

� �� ) � ) �� � 1 ���

�. � *1

# �� ���� exp

���

�.)+*

1 � �� ) � �� ) � � ) � � ) � �� � 1��/

�. � *1

# �� � �� �. )+*1 � �� ) � � ) � � � � � �� � <�� � 1 ��

� �.)+*

1 � �� ) 2 ��� ) � � � �� � � ��� ) � � � � � ��� ) � ��� ) 3

� ��.)+*

1 � �� ) " � ��� ) �� ��� ) $ 2 � � � � � �

Finally, if� � � � 0 then

max� ,�� M

�.)+*

1 � )" � �) � � �

� �) � � $ 2 �0 �

Since for every � there exists � � � M with � ) � 0, thisimplies

�) � � � �) � � for all � , i.e.,' � # � � ) �

0.Applying Lemma 2 completes the theorem.

8 MULTICLASS PROBLEMS

In this section, we show how all of our results can be extendedto the multiclass case. Because of the generality of the pre-ceding results, we will see that no new algorithms need bedevised and no new convergence proofs need be proved forthis case. Rather, all of the preceding algorithms and proofscan be directly applied to the multiclass case.

In the multiclass case, the label set � has cardinality � .Each feature is of the form � ) : � ' � � !

. In logisticregression, we use a model

Pr� � � � � � ����� � '�� ,�� � ����� � � 1

1 � '����* � � ����� � ������� � �30

where # $ ��� �� � ' �)+*1 , ) � ) ��� �+ . The loss on a training

set then is �. � *1

ln��1 � .

���* � < � ����� < � �;���% � < � < �� � �

31

8

We transform this into our framework as follows: Let� �(� � � ��� � 1 / � / & ��� � � � � � �-� �The vectors � , � , etc. that we work with are in

!���. That is,

they are� � � 1 & -dimensional and are indexed by pairs in�

. Let ¯ � denote'����* � < � � . The convex function � that we

use for this case is

� � � � �. � *1

�� .���* � < � � ln � � � �1 � ¯ � ln

�1 � ¯ � ��

which is defined over the space

∆��� � � !��� � � � : ¯ � / 1 � �

The resulting Bregman distance is��� 2 �8� � 3�

�. � *1

�� .���* � � � ln

" � �# � � $ � �1 � ¯ � ln

"1 � ¯ �1 � # � $ �� �

Clearly, � � 2 � � � 3 � ��. � *

1

ln�1 � # � �

It can be shown that

� 576 � � � � # � � � � : < � 1 � # � � ' ���* � < # � � � � : < � �

Assumption 1 can be verified by noting that�C� 2 � ��576 � 3 � ��� 2 � � � 3�

�. � *1

ln

�1 � # �

1 � � 576 � � ��

�. � *1

ln

��1 � # � � .

���* � < # � � � � : < � �� (32)

/�. � *

1

��� # � � .

���* � < # � � � �;: < � ��� . � � , � # � � � � �;: < � � 1 �

Now let � � ) � � ) ����� �� � � � ) ���&� ��� , and let � 0��

1 � � � . Plugging in these definitions gives that� � 2 � � � + 6 � 0 3 is equal to Eq. (31). Thus, the al-gorithms of Sections 5, 6 and 7 can all be used to solve thisminimization problem, and the corresponding convergenceproofs are also directly applicable.

There are several multiclass versions of AdaBoost. Ada-Boost.M2 [13] (a special case of AdaBoost.MR [20]), isbased on the loss function.

� � , � exp 2 # $ ��� � �� � # $ ��� � �+ � 3 � �33

For this loss, we can use a similar set up except for the choiceof � . We instead use� � � � .

� � , � � � ln � �for � �

∆��!���

. In fact, this is actually the same � usedfor (binary) AdaBoost. We have merely changed the indexset to

�. Thus, as before,� � 2 � � � 3 � .

� � , � # � �and � 5 6 � � � � # � � � �;: < � �Choosing + as we did for multiclass logistic regression and� 0

� �, we have that

� � 2 � � � + 6 � 0 3 is equal to theloss in Eq. (33). We can thus use the preceding algorithmsto solve this multiclass problem as well. In particular, thesequential-update algorithm gives AdaBoost.M2.

AdaBoost.MH [20] is another multiclass version of Ada-Boost. For AdaBoost.MH, we replace

�by the index set�

1 � � ��� � & � ' � �and for each example � and label � � � , we define

˜ � � � � � 1 if � � �� 1 if � �� � .The loss function for AdaBoost.MH is�. � *

1

.� ,�� exp 2 � ˜ � � #%$ ���&� ��� �3 � �

34 We now let � � ) � ˜ � � � ) ��� � �� and use again the same �as in binary AdaBoost with � 0

� �to obtain this multiclass

version of AdaBoost.

9 A COMPARISON TO ITERATIVESCALING

In this section, we describe the generalized iterative scaling(GIS) procedure of Darroch and Ratcliff [9] for comparisonto our algorithms. We largely follow the description of GISgiven by Berger, Della Pietra and Della Pietra [1] for themulticlass case. To make the comparison as stark as possible,we present GIS in our notation and prove its convergenceusing the methods developed in previous sections. In doingso, we are also able to relax one of the key assumptionstraditionally used in studying GIS.

We adopt the notation and set-up used for multiclass lo-gistic regression in Section 8. (To our knowledge, there is noanalog of GIS for the exponential loss so we only considerthe case of logistic loss.) We also extend this notation bydefining # � � < � 1 � # � so that # � � is now defined for all� � � . Moreover, it can be verified that # � � � Pr

� � � �&� � asdefined in Eq. (30) if � � � + 6 � 0.

In GIS, the following assumptions regarding the featuresare usually made:

� � ��� �� : � ) ����� �� � 0 and� � �� :

�.)+*

1

� ) ����� �� � 1 �

9

In this section, we prove that GIS converges with the secondcondition replaced by a milder one, namely, that

� � ��� :�.)+*

1

� ) ��� � ��� 0/ 1 �Since, in the multiclass case, a constant can be added toall features � ) without changing the model or loss function,and since the features can be scaled by any constant, the twoassumptions we consider clearly can be made to hold withoutloss of generality. The improved iterative scaling algorithmof Della Pietra, Della Pietra and Lafferty [10] also requiresonly these milder assumptions but is much more complicatedto implement, requiring a numerical search (such as Newton-Raphson) for each feature on each iteration.

GIS works much like the parallel-update algorithm ofSection 5 with � , + and � 0 as defined for multiclass logisticregression in Section 8. The only difference is in the com-putation of the vector of updates � � , for which GIS requiresdirect access to the features � ) . Specifically, in GIS, � � isdefined to be � �� ) � ln

" � )� ) � � � $where � ) �

�. � *1

� ) ���&� �+ � � ) � � �

�. � *1

.� ,�� # � � � ) ��� � �� �

Clearly, these updates are quite different from the updatesdescribed in this paper.

Using more notation from Sections 5 and 8, we can re-formulate

� ) � � within our framework as follows:� ) � � ��. � *

1

.� ,�� # � � � ) ���&� ���

��. � *

1

� ) ����� �� � �

�. � *1

.� ,�� # � ��� � ) ��� � �� � � ) ��� � �� � ��

� � ) � . � � , � # � � � � )� � ) � � �) � � � �) � � � � (35)

We can now prove the convergence of these updates usingthe usual auxiliary function method.

Theorem 6 Let � , M and q0 be as above. Then the modifiedGIS algorithm described above converges to optimality in thesense of Theorem 3.

Proof: We will show that� � � �� � D ! 2 � � 1 ��� � � � � � � � � �1� � � ��� � � � � � � �� 3

� ��.)+*

1

" � ) ln� )� ) � � � � ) � � � � ) $ (36)

is an auxilliary function for the vectors � 1 � � 2 � � ��� computedby GIS. Clearly,

�is continuous, and the usual nonnega-

tivity properties of unnormalized relative entropy imply that� � � / 0 with equality if and only if� ) � � ) � � for all � .

From Eq. (35),� ) � � ) � � if and only if

�) � � � �) � � .Thus,

� � � � 0 implies that the constraints � T + � �as in

the proof of Theorem 3. All that remains to be shown is that� � 2 � � � + � 6 � 3 � � � 2 � � � 3 / � � � �37

where � ) � ln

" � )� ) � � $ �We introduce the notation

∆� � � �

�.)+*

1

� ) � ) ����� �� �

and then rewrite the gain as follows using Eq. (32):�C� 2 � � � + � 6 � 3 � �C� 2 � � � 3�

�. � *1

ln

�� # � � < � .���* � < # � � exp

���

�.)+*

1

� ) � � ) �� ��� �

�. � *1

∆� � �

��. � *

1

ln�� � ∆ < � < �� # � � < � .

���* � < # � � � �' ���� 1 � � �� < � � � � �� �� �

(38)

Plugging in definitions, the first term of Eq. (38) can bewritten as�. � *

1

∆��� � � �.

)+*1

�ln

" � )� ) � � $�. � *

1

� ) ���&� �+ � ��� �.

)+*1

� ) ln

" � )� ) � � $ � (39)

Next we derive an upper bound on the second term of Eq. (38):�. � *1

ln�� � ∆ < � < �� # � � < � .

���* � < # � � � �' ���� 1 � � � � < � � � � �� ��

��. � *

1

ln

�� # � � < � ∆ < � < � .���* � < # � � � ∆ < � ��

��. � *

1

ln

� .� ,�� # � � � ∆ < � �

10

100

101

102

103

104

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1training loss

i.s.seq seq2par

100

101

102

103

104

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1training loss

i.s.seq seq2par

Figure 4: The training logistic loss on data generated by a noisy hyperplane with many (left) or few (right) relevant features.

/�. � *

1

� .� ,�� # � � � ∆ < � � 1 � (40)

��. � *

1

.� ,�� # � � �� exp

�� �.)+*

1

� ) ��� � ��� � ) �� � 1�� (41)

/�. � *

1

.� ,�� # � � �.

)+*1

� ) ����� �� � � � � � 1 (42)

��. � *

1

.� ,�� # � � �.

)+*1

� ) ����� �� " � )� ) � � � 1 $ (43)

� �.)+*

1

" � )� ) � � � 1 $ �. � *1

.� ,�� # � � � ) ��� � ���

� �.)+*

1

� � ) � � ) � � + � (44)

Eq. (40) follows from the log bound ln� / � � 1. Eq. (42)

uses Eq. (25) and our assumption on the form of the � ) ’s.Eq. (43) follows from our definition of the update � .

Finally, combining Eqs. (36), (38), (39) and (44) givesEq. (37) completing the proof.

It is clear that the differences between GIS and the updatesgiven in this paper stem from Eq. (38), which is derived fromln� � � � � ln 2 ��� � 3 , with

� �∆� � � on the � ’th term

in the sum. This choice of�

effectively means that the logbound is taken at a different point (ln

� � � � � ln 2 ��� � 3 /� � � ��� � � 1). In this more general case, the bound isexact at

� � � � � ; hence, varying�

varies where the boundis taken, and thereby varies the updates.

10 EXPERIMENTSIn this section, we briefly describe some experiments us-ing synthetic data. These experiments are preliminary andare only intended to suggest the possibility of these algo-rithms’ having practical value. More systematic experimentsare clearly needed using both real-world and synthetic data,

and comparing the new algorithms to other commonly usedprocedures.

We first tested how effective the methods are at mini-mizing the logistic loss on the training data. In the first ex-periment, we generated data using a very noisy hyperplane.More specifically, we first generated a random hyperplanein 100-dimensional space represented by a vector

= ��! 100

(chosen uniformly at random from the unit sphere). We thenchose 300 points �

� ! 100 where each point is normallydistributed �����

� � � . We next assigned a label to eachpoint depending on whether it fell above or below the cho-sen hyperplane, i.e., � sign

�>= � � . After each label waschosen, we perturbed each point � by adding to it a randomamount � where ��� �

� � � 0 � 8 . This had the effect ofcausing the labels of points near the separating hyperplaneto be more noisy than points that are farther from it. Thefeatures were identified with coordinates of � .

We ran the parallel- and sequential-update algorithms ofSections 5 and 6 (denoted “par” and “seq” in the figures)on this data. We also ran the sequential-update algorithmthat is a special case of the parameterized family describedin Section 7 (denoted “seq2”). Finally, we ran the iterativescaling algorithm described in Section 9 (“i.s.”).

The results of this experiment are shown on the left ofFig. 4 which shows a plot of the logistic loss on the trainingset for each of the four methods as a function of the numberof iterations. (The loss has been normalized to be 1 when � �

.) All of our methods do very well in comparisonto iterative scaling. The parallel-update method is clearlythe best, followed closely by the second sequential-updatealgorithm. The parallel-update method can be as much as 30times faster (in terms of number of iterations) than iterativescaling.

On the right of Fig. 4 are shown the results of a simi-lar experiment in which all but four of the components of=

were forced to be zero. In other words, there were onlyfour relevant variables or features. In this experiment, thesequential-update algorithms, which perform a kind of fea-ture selection, initially have a significant advantage over the

11

100

101

102

103

0.15

0.2

0.25

0.3

0.35

0.4test error

log seqexp seqlog parexp par

Figure 5: The test misclassification error on data generatedby a noisy hyperplane with Boolean features.

parallel-update algorithm, but are eventually overtaken.In the last experiment, we tested how effective the new

competitors of AdaBoost are at minimizing the test misclas-sification error. In this experiment, we chose a separatinghyperplane

=as in the first experiment. Now, however, we

chose 1000 points � uniformly at random from the Booleanhypercube

� � 1 � � 1 � 100. The labels were computed asbefore. After the labels were chosen, we flipped each coor-dinate of each point � independently with probability 0 � 05.This noise model again has the effect of causing examplesnear the decision surface to be noisier than those far from it.

For this experiment, we used the parallel- and sequential-update algorithms of Sections 5 and 6 (denoted “par” and“seq”). In both cases, we used variants based on exponen-tial loss (“exp”) and logistic loss (“log”). (In this case, thesequential-update algorithms of Sections 6 and 7 are identi-cal.)

Fig. 5 shows a plot of the classification error on a separatetest set of 5000 examples. There is not a very large differencein the performance of the exponential and logistic variants ofthe algorithms. However, the parallel-update variants startout doing much better, although eventually all of the methodsconverge to roughly the same performance level.

ACKNOWLEDGMENTSMany thanks to Manfred Warmuth for first teaching us aboutBregman distances and for many comments on an earlierdraft. Thanks also to Nigel Duffy, David Helmbold and RajIyer for helpful discussions and suggestions. Some of thisresearch was done while Yoram Singer was at AT&T Labs.

References[1] Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della

Pietra. A maximum entropy approach to natural languageprocessing. Computational Linguistics, 22(1):39–71, 1996.

[2] L. M. Bregman. The relaxation method of finding the commonpoint of convex sets and its application to the solution ofproblems in convex programming. U.S.S.R. ComputationalMathematics and Mathematical Physics, 7(1):200–217, 1967.

[3] Leo Breiman. Arcing the edge. Technical Report 486, Statis-tics Department, University of California at Berkeley, 1997.

[4] Leo Breiman. Prediction games and arcing classifiers. Techni-cal Report 504, Statistics Department, University of Californiaat Berkeley, 1997.

[5] Nicolo Cesa-Bianchi, Anders Krogh, and Manfred K. War-muth. Bounds on approximate steepest descent for likelihoodmaximization in exponential families. IEEE Transactions onInformation Theory, 40(4):1215–1220, July 1994.

[6] I. Csiszar. I-divergence geometry of probability distribu-tions and minimization problems. The Annals of Probability,3(1):146–158, 1975.

[7] I. Csiszar. Sanov property, generalized I-projection and aconditional limit theorem. Annals of Probability, 12:768–793,1984.

[8] I. Csiszar. Maxent, mathematics, and information theory. InProceedings of the Fifteenth International Workshop on Max-imum Entropy and Bayesian Methods, pages 35–50, 1995.

[9] J. N. Darroch and D. Ratcliff. Generalized iterative scalingfor log-linear models. The Annals of Mathematical Statistics,43(5):1470–1480, 1972.

[10] Stephen Della Pietra, Vincent Della Pietra, and John Lafferty.Inducing features of random fields. IEEE Transactions PatternAnalysis and Machine Intelligence, 19(4):1–13, April 1997.

[11] Carlos Domingo and Osamu Watanabe. Scaling up a boosting-based learner via adaptive sampling. In Proceedings of theFourth Pacific-Asia Conference on Knowledge Discovery andData Mining, 2000.

[12] Nigel Duffy and David Helmbold. Potential boosters? InAdvances in Neural Information Processing Systems 11, 1999.

[13] Yoav Freund and Robert E. Schapire. A decision-theoretic gen-eralization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55(1):119–139,August 1997.

[14] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Ad-ditive logistic regression: a statistical view of boosting. TheAnnals of Statistics, to appear.

[15] Klaus-U. Hoffgen and Hans-U. Simon. Robust trainabilityof single neurons. In Proceedings of the Fifth Annual ACMWorkshop on Computational Learning Theory, pages 428–439, July 1992.

[16] Jyrki Kivinen and Manfred K. Warmuth. Boosting as entropyprojection. In Proceedings of the Twelfth Annual Conferenceon Computational Learning Theory, pages 134–144, 1999.

[17] John Lafferty. Additive models, boosting and inference forgeneralized divergences. In Proceedings of the Twelfth AnnualConference on Computational Learning Theory, pages 125–133, 1999.

[18] John D. Lafferty, Stephen Della Pietra, and Vincent DellaPietra. Statistical learning algorithms based on Bregman dis-tances. In Proceedings of the Canadian Workshop on Infor-mation Theory, 1997.

[19] Llew Mason, Jonathan Baxter, Peter Bartlett, and MarcusFrean. Functional gradient techniques for combining hypothe-ses. In Advances in Large Margin Classifiers. MIT Press,1999.

[20] Robert E. Schapire and Yoram Singer. Improved boosting al-gorithms using confidence-rated predictions. Machine Learn-ing, 37(3):297–336, December 1999.

[21] F. Topsoe. Information theoretical optimization techniques.Kybernetika, 15:7–17, 1979.

[22] Osamu Watanabe. From computational learning theory to dis-covery science. In Proceedings of the 26th International Col-loquium on Automata, Languages and Programming, pages134–148, 1999.

12