A review of robust clustering methods

19
Advances in Data Analysis and Classification manuscript No. (will be inserted by the editor) A Review on Consistency and Robustness Properties of Support Vector Machines for Heavy-Tailed Distributions Arnout Van Messem · Andreas Christmann Received: date / Accepted: date Abstract Support vector machines (SVMs) belong to the class of modern statistical machine learning techniques and can be described as M-estimators with a Hilbert norm regularization term for functions. SVMs are consistent and robust for classification and regression purposes if based on a Lipschitz continuous loss and a bounded continu- ous kernel with a dense reproducing kernel Hilbert space. For regression, one of the conditions used is that the output variable Y has a finite first absolute moment. This assumption, however, excludes heavy-tailed distributions. Recently, the applicability of SVMs was enlarged to these distributions by considering shifted loss functions. In this review paper, we briefly describe the approach of SVMs based on shifted loss functions and list some properties of such SVMs. Then, we prove that SVMs based on a bounded continuous kernel and on a convex and Lipschitz continuous, but not necessarily dif- ferentiable, shifted loss function have a bounded Bouligand influence function for all distributions, even for heavy-tailed distributions including extreme value distributions and Cauchy distributions. SVMs are thus robust in this sense. Our result covers the important loss functions ²-insensitive for regression and pinball for quantile regression, which were not covered by earlier results on the influence function. We demonstrate the usefulness of SVMs even for heavy-tailed distributions by applying SVMs to a sim- ulated data set with Cauchy errors and to a data set of large fire insurance claims of Copenhagen Re. Keywords Regularized empirical risk minimization · support vector machines · consistency · robustness · Bouligand influence function · heavy tails. Mathematics Subject Classification (2000) 68Q32 · 62G35 · 62G08 · 62F35 · 68T10. A. Van Messem Department of Mathematics, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium Phone: +32-2-6293498 Fax: +32-2-6293495 E-mail: E-mail: [email protected] A. Christmann Department of Mathematics, University of Bayreuth, 95440 Bayreuth, Germany E-mail: E-mail: [email protected]

Transcript of A review of robust clustering methods

Advances in Data Analysis and Classification manuscript No.(will be inserted by the editor)

A Review on Consistency and Robustness Properties ofSupport Vector Machines for Heavy-Tailed Distributions

Arnout Van Messem · Andreas Christmann

Received: date / Accepted: date

Abstract Support vector machines (SVMs) belong to the class of modern statistical

machine learning techniques and can be described as M-estimators with a Hilbert norm

regularization term for functions. SVMs are consistent and robust for classification and

regression purposes if based on a Lipschitz continuous loss and a bounded continu-

ous kernel with a dense reproducing kernel Hilbert space. For regression, one of the

conditions used is that the output variable Y has a finite first absolute moment. This

assumption, however, excludes heavy-tailed distributions. Recently, the applicability of

SVMs was enlarged to these distributions by considering shifted loss functions. In this

review paper, we briefly describe the approach of SVMs based on shifted loss functions

and list some properties of such SVMs. Then, we prove that SVMs based on a bounded

continuous kernel and on a convex and Lipschitz continuous, but not necessarily dif-

ferentiable, shifted loss function have a bounded Bouligand influence function for all

distributions, even for heavy-tailed distributions including extreme value distributions

and Cauchy distributions. SVMs are thus robust in this sense. Our result covers the

important loss functions ε-insensitive for regression and pinball for quantile regression,

which were not covered by earlier results on the influence function. We demonstrate

the usefulness of SVMs even for heavy-tailed distributions by applying SVMs to a sim-

ulated data set with Cauchy errors and to a data set of large fire insurance claims of

Copenhagen Re.

Keywords Regularized empirical risk minimization · support vector machines ·consistency · robustness · Bouligand influence function · heavy tails.

Mathematics Subject Classification (2000) 68Q32 · 62G35 · 62G08 · 62F35 ·68T10.

A. Van MessemDepartment of Mathematics, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, BelgiumPhone: +32-2-6293498Fax: +32-2-6293495E-mail: E-mail: [email protected]

A. ChristmannDepartment of Mathematics, University of Bayreuth, 95440 Bayreuth, GermanyE-mail: E-mail: [email protected]

2

1 Introduction and Overview

Support vector machines belong to the large class of techniques from modern sta-

tistical machine learning theory. Informally, SVMs are M-estimators with a Hilbert

norm regularization term for functions. SVMs are used both for classification and

regression purposes. Non-parametric regression aims to estimate a functional relation-

ship between an X -valued input random variable X and a Y-valued output random

variable Y , assuming that the joint distribution P of (X, Y ) is (almost) completely

unknown. Common choices of input and output spaces are X × Y = Rd × {−1, +1}in the binary classification case and X × Y = Rd × R in the regression case. In or-

der to model this relationship one typically assumes that one has a training data set

D = ((x1, y1), . . . , (xn, yn)) ∈ (X × Y)n consisting of observations from independent

and identically distributed (i.i.d.) random variables (Xi, Yi), i = 1, . . . , n, which all

have the same distribution P. The goal is to build a function f : X → Y based on these

observations which will assign to each risk vector x a prediction f(x) which hopefully

will be a good approximation of the observed output y.

The goal of this article is to review some properties of SVMs and to show that SVMs

can even successfully deal with heavy-tailed conditional distributions of the response

variable Y given x. Therefore, we will mainly consider regression. Of course, the case of

classification where the output space Y is just a set of a finite number of real numbers

is a special case and thus classification is covered by the results we will show. The

problem of heavy tails is however not present in classification because the conditional

distribution of the response variable Y given x has then obviously a bounded support.

We will now motivate why support vector machines are defined as they are, for

the precise definition we refer to Definition 1, and describe the main goals of SVMs.

We refer to Vapnik (1998), Cristianini and Shawe-Taylor (2000), Scholkopf and Smola

(2002), and Steinwart and Christmann (2008b) for textbooks on SVMs and related

topics. To formalize the aim of estimating the predictor function f , we call a function

L : X × Y × R → [0,∞) a loss function (or just loss) if L is measurable. The loss

function assesses the quality of a prediction f(x) for an observed output value y by

L(x, y, f(x)). We follow the convention that the smaller L(x, y, f(x)) is the better the

prediction is. We will also assume that L(x, y, y) = 0 for all y ∈ Y, since practitioners

usually argue that the loss is zero, if the forecast f(x) equals the observed value y. We

call L (or L?) convex, continuous, Lipschitz continuous, or differentiable, if L (or L?)

has this property with respect to its third argument. E.g., L is Lipschitz continuous if

there exists a constant |L|1 ∈ (0,∞) such that, for all (x, y) ∈ X ×Y and all t1, t2 ∈ R,

|L(x, y, t1)− L(x, y, t2)| ≤ |L|1 |t1 − t2|.Some popular loss functions for regression are the ε-insensitive loss

L(x, y, t) := max{0, |y − t| − ε}for some ε > 0 , Huber’s loss

L(x, y, t) :=

(0.5(y − t)2 if |y − t| ≤ α

α|y − t| − 0.5α2 if |y − t| > α

for some α > 0, and the logistic loss

L(x, y, t) := − ln4 exp(y − t)

(1 + exp(y − t))2

3

for regression. The pinball loss

L(x, y, t) :=

((τ − 1)(y − t), if y − t < 0 ,

τ(y − t), if y − t ≥ 0 ,

where τ > 0, is useful for quantile regression and often used in econometrics, see e.g.,

Koenker (2005). Plots of these loss functions can be seen in Figure 1. A classical loss

function for classification purposes is the hinge loss function

L(x, y, t) = max{0, 1− yt}, (x, y, t) ∈ X × {−1, +1} ×R.

All five loss functions mentioned above are convex and Lipschitz continuous, but

only the logistic loss is twice continuously (Frechet-)differentiable. Note that both the

ε-insensitive loss and the pinball loss are not (Frechet-)differentiable.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

eps−insensitive Pinball

−3 −2 −1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Huber

−3 −2 −1 0 1 2 3

Logistic

y−t

loss

Fig. 1 The plot shows some commonly used loss functions for regression: ε-insensitive losswith ε = 0.5, pinball loss with τ = 0.7, Huber loss with α = 0.5, and logistic loss.

Of course, in order to assess the quality of a predictor f it is not sufficient to only

know the value L(x, y, f(x)) for a particular choice of (x, y), but in fact we need to

quantify how small the function (x, y) 7→ L(x, y, f(x)) is. There are various ways to do

4

this, but a common choice in statistical learning theory is to consider the expected loss

of f , also called the L-risk, defined by

RL,P(f) := EPL(X, Y, f(X)) .

The learning goal is then to find a function f∗ that (approximately) achieves the

smallest possible risk, i.e., the Bayes risk

R∗L,P := inf{RL,P(f) ; f : X → R measurable} . (1)

However, since the distribution P is unknown, the risk RL,P(f) is also unknown and

consequently we cannot compute f∗. To resolve this problem, we can replace P by

the empirical distribution D = 1n

Pni=1 δ(xi,yi) corresponding to the data set D. Here

δ(xi,yi) denotes the Dirac distribution in (xi, yi). This way we obtain the empirical

L-risk

RL,D(f) :=1

n

nX

i=1

L(xi, yi, f(xi)) .

Although RL,D(f) can be considered as an approximation of RL,P(f) for each

single f , solving inff :X→RRL,D(f) will in general not result in a good approximate

minimizer of RL,P( · ) due to the effect of overfitting, in which the learning method will

model the data D too closely and thus will have a poor performance on future data.

One way to reduce the danger of overfitting is to not consider all measurable func-

tions f : X → R but to choose a smaller, but still reasonably rich, class F of functions

that is assumed to contain a good approximation of the solution of (1). Instead of then

minimizing RL,D( · ) over all measurable functions, one minimizes only over F , i.e.,

one solves

inff∈F

RL,D(f) . (2)

This approach, called empirical risk minimization (ERM), often tends to produce ap-

proximate solutions of

R∗L,P,F := inff∈F

RL,P(f) . (3)

There are, however, two serious issues with ERM. The first one is that our knowledge

of the distribution P is often not good enough to identify a set F such that a solution of

(3) is a reasonably good approximation of a solution of (1). Or, the other way around,

we usually cannot guarantee that the model error R∗L,P,F −R∗L,P is sufficiently small.

The second problem is that solving (2) might be computationally infeasible.

A first step of SVMs to make the optimization problem computationally feasible is

to use a convex loss function, because it is easy to show that then the risk functional

f 7→ RL,P(f) is convex. If we further assume that the set F of functions over which we

will have to optimize is also convex, the learning method defined by (2) will become a

convex optimization problem.

A second step of SVMs towards computational feasibility is to consider only a

very specific set of functions, namely the reproducing kernel Hilbert space (RKHS) Hof some measurable kernel k : X × X → R. Every RKHS is uniquely defined by its

kernel k and vice versa. The kernel can be used to describe all functions contained

in H. Moreover, the value k(x, x′) can often be interpreted as a measure of similarity

or dissimilarity between the two risk vectors x and x′. The function Φ : X → H,

Φ(x) := k( · , x) is called the canonical feature map. The reproducing property states

5

that f(x) = 〈f, Φ(x)〉H for all f ∈ H and all x ∈ X . A kernel k is called bounded, if

‖k‖∞ := sup{p

k(x, x) : x ∈ X} < ∞ .

A third step of SVMs towards computational feasibility – and also to uniqueness of

the SVM – is to use a special Hilbert norm regularization term which we will describe

now. It is well-known that the sum of a convex function and of a strictly convex function

over a convex set is strictly convex. Let us fix an RKHS H over the input space X and

denote the norm in H by ‖ · ‖H. The regularization term λ ‖f‖2H also serves to reduce

the danger of overfitting. It will penalize rather complex functions f which model the

output values in the training set D too closely, since these functions will have a large

RKHS norm. The term

RregL,P,λ(f) := RL,P(f) + λ ‖f‖2H

is called the regularized L-risk.

Definition 1 Let X be the input space, Y ⊂ R be the output space, L : X×Y×R→ R

be a loss function, H be a reproducing kernel Hilbert space of functions from X to R,

and λ > 0 be a constant. For a probability distribution P on X ×Y, a support vector

machine is defined as the minimizer

fL,P,λ := arg inff∈H

RL,P(f) + λ ‖f‖2H . (4)

The empirical SVM will be denoted by

fL,D,λ := arg inff∈H

RL,D(f) + λ ‖f‖2H = arg inff∈H

1

n

nX

i=1

L(xi, yi, f(xi)) + λ ‖f‖2H . (5)

The reasons to consider convex losses are fourfold. Given some weak assumptions,

SVMs based on a convex loss function have at least the following desirable properties,

see e.g., Vapnik (1998), Cristianini and Shawe-Taylor (2000), Scholkopf and Smola

(2002), and Steinwart and Christmann (2008b) for details. (i) The objective function

in (4) becomes convex in f and a support vector machine fL,D,λ exists, is unique,

and depends continuously on the data points in basically all situations of interest. The

SVM is therefore the solution of a well-posed optimization problem in Hadamard’s

sense. Moreover, this minimizer is of the form

fL,D,λ =

nX

i=1

αik(xi, · ) , (6)

where k is the kernel corresponding to H and the αi ∈ R, i = 1 . . . n, are suitable

coefficients. We see that the minimizer fL,D,λ is thus a weighted sum of (at most) n

kernel functions k(xi, · ), where the weights αi are data-dependent. A consequence of

(6) is that the SVM fL,D,λ is contained in a finite dimensional space, even if the space

H itself is considerably larger. This observation makes it possible to consider even

infinite dimensional spaces H such as the one corresponding to the popular Gaussian

radial basis function (RBF) kernel defined by

kRBF(x, x′) = exp(−γ−2 ‖x− x′‖2) , x, x′ ∈ Rd ,

where γ is a positive constant called the width. (ii) SVMs are under very weak assump-

tions L-risk consistent, i.e., for suitable null-sequences (λn) with λn > 0 we have that

6

RL,P(fL,D,λn) converges in probability to the Bayes risk R∗L,P for n →∞. (iii) SVMs

have good statistical robustness properties, if k is continuous and bounded and if L is

Lipschitz continuous with respect to its third argument, see Christmann and Steinwart

(2004, 2007), Christmann and Van Messem (2008), and Christmann et al. (2009). In

a nutshell, statistical robustness implies that the support vector machine fL,P,λ only

varies in a smooth and bounded manner if P changes slightly in the set M1 of all prob-

ability measures on X ×Y. (iv) There exist efficient numerical algorithms to determine

the vector of the weights1 α = (α1, . . . , αn) in the empirical representation (6) and

therefore also the SVM fL,D,λ even for large and high-dimensional data sets D. From

a numerical point of view, the vector α is usually computed as a solution of the convex

dual problem derived from a Lagrange approach.

The rest of the paper is organized as follows. In Section 2 we introduce the simple

concept of shifted loss functions L? and we explain their importance to investigate

statistical properties of support vector machines for all probability measures P, because

no moment assumptions on the conditional distribution of Y given x are needed to

guarantee the existence of SVMs. We will summarize some recently derived facts on

the loss L?, on the risk RL?,P(fL?,P,λ), and on consistency and robustness properties

of SVMs based on L?. Section 3 contains an important result, namely that the SVM

fL?,P,λ has a bounded Bouligand influence function under weak assumptions, even

for non-Frechet-differentiable loss functions L?. Hence the SVM fL?,P,λ is robust in

this sense. Section 4 illustrates the robustness properties of the SVM for heavy-tailed

distributions by giving some numerical examples and Section 5 contains a discussion.

The proof of the theoretical result of Section 3, together with some facts on Bouligand-

derivatives, is given in the Appendix.

2 SVMs based on shifted loss functions

The goal of this section is twofold. (i) We describe why SVMs for response variables

with heavy tails need a careful consideration. (ii) We show that SVMs based on shifted

loss functions L? are defined for all distributions and that they are identical to SVMs

based on unshifted loss functions for all data sets. This section reviews some results

derived on SVMs by Christmann and Steinwart (2008), Christmann and Van Messem

(2008), Steinwart and Christmann (2008a,b), and Christmann et al. (2009).

Although many of the results hold in a more general setting, we will make in this

review the following assumptions for the rest of the paper for three reasons: (i) these

assumptions are weak enough to cover many support vector machines used in practice,

(ii) it avoids listing special assumptions for each result, and (iii) these assumptions

are independent of the data set, i.e., they can really be checked.

General Assumption:

Let n ∈ N, X be a complete separable metric space (e.g., X = Rd), Y ⊂ R be a

non-empty and closed set (e.g., Y = R or Y = {−1, +1}), and P be a probability

distribution on X ×Y equipped with its Borel σ-algebra. Let L : X ×Y ×R→ [0,∞) be

1 The vector α is in general not unique. This is easy to see from (6) for the case that twodata points are identical, say (xi, yi) = (xj , yj) for some pair of indices i 6= j. However, theSVM fL,D,λ exists and is unique under the very weak assumptions we will list in the nextsection.

7

a convex and Lipschitz continuous loss function and denote its shifted loss function

L? : X × Y ×R→ R by

L?(x, y, t) := L(x, y, t)− L(x, y, 0) .

Let k : X ×X → R be a continuous and bounded kernel with reproducing kernel Hilbert

space H of functions f : X → R. The canonical feature map is denoted by Φ : X → H,

where Φ(x) := k( · , x) for x ∈ X . Let λ ∈ (0,∞) and (λn)n∈N be a null-sequence of

real numbers with λn ∈ (0,∞) and λ2nn →∞ for n →∞.

Since L(x, y, t) ∈ [0,∞), it follows from the definition that −∞ < L?(x, y, t) < ∞.

Because we assumed Y ⊂ R to be closed, the probability measure P can be split up

into the marginal distribution PX on X and the conditional probability P(y|x) on Y.

It has been shown that the L-risk RL,P(f) is finite, if

Z

X|f(x)| dPX(x) < ∞ and EP|Y | =

Z

X

Z

Y|y| dP(y|x) dPX(x) < ∞ ,

where the first condition is noted in short as f ∈ L1(PX). The latter condition, which

is one of the assumptions made for consistency and robustness proofs of SVMs for

an unbounded output set Y, unfortunately excludes heavy-tailed distributions such as

many stable distributions, including Cauchy distributions, as well as many extreme

value distributions which occur in financial or actuarial problems.

To enlarge the applicability of SVMs to heavy-tailed distributions, it has been

proposed to shift the loss L(x, y, t) downwards by the amount of L(x, y, 0) ∈ [0,∞), see

e.g., Huber (1967) for an early use of this trick for M-estimators (without regularization

term) in robust statistics. For brevity, we will call this the “L?-trick”. It was shown

that many important results on the SVM fL,P,λ using the traditional non-negative loss

function L also hold true for the SVM

fL?,P,λ := arg inff∈H

RL?,P(f) + λ ‖f‖2H ,

based on the shifted loss function. Here,

RL?,P(f) := EPL?(X, Y, f(X))

denotes the L?-risk of f . Moreover, it was shown that

fL?,P,λ = fL,P,λ if fL,P,λ exists,

which has the great advantage that no new algorithms are needed to compute fL?,D,λ

because the empirical SVM fL,D,λ exists for all data sets D = ((x1, y1), . . . , (xn, yn)) ⊂(X × Y)n. It was also shown that the L?-risk RL?,P(f) is finite for all f ∈ L1(PX),

no matter whether the moment condition EP|Y | < ∞ is fulfilled or violated. The

advantage of fL?,P,λ over fL,P,λ is therefore that fL?,P,λ is still well-defined and useful

for heavy-tailed conditional distributions of Y given x, for which the first absolute

moment EP|Y | is infinite. In particular, these results show that even in the case of

heavy-tailed distributions, the forecasts

fL?,D,λn(x) = fL,D,λn

(x)

are consistent and robust with respect to the influence function and the maxbias, if a

bounded, continuous kernel and a convex, Lipschitz continuous loss function are used.

8

In this respect, the combination of the Gaussian RBF kernel with the ε-insensitive loss

function or Huber’s loss function for regression purposes or with the pinball loss for

quantile regression yields SVMs with good consistency and robustness properties.

Let us now briefly summarize these results. First of all, there exists a strong re-

lationship between the loss L and its shifted version L? in terms of convexity and

Lipschitz continuity. If L is Lipschitz continuous, L? is also Lipschitz continuous, and

both Lipschitz constants are equal, i.e., |L|1 = |L?|1, and if L is (strictly) convex, then

L? is (strictly) convex. This yields for a convex loss L that the mapping

f 7→ RL?,P(f) + λ ‖f‖2H , f ∈ H ,

is a strictly convex function because it is the sum of the convex risk functional RL?,P

and the strictly convex mapping f 7→ λ ‖f‖2H.

Furthermore, for L being a Lipschitz continuous loss, bounds for the risks RL?,P(f)

and RregL?,P,λ(f), for all f ∈ H, as well as useful upper bounds for

‚‚fL?,P,λ

‚‚2H have

been obtained.

Let us now assume that the kernel k is additionally bounded. Then both the supre-

mum norm‚‚fL?,P,λ

‚‚∞ and the risk RL?,P(fL?,P,λ) are finite. If the partial Frechet-

and Bouligand-derivatives2 of L and L? exist for (x, y) ∈ X × Y, then, for all t ∈ R,

∇F3 L?(x, y, t) = ∇F

3 L(x, y, t) and ∇B3 L?(x, y, t) = ∇B

3 L(x, y, t) . (7)

Furthermore, for L a Lipschitz continuous loss function, the mapping

RL?,P : L∞(PX) → R

is well-defined and continuous. As a consequence, the function f 7→ RregL?,P,λ(f) is

continuous, since both mappings f 7→ RL?,P(f) and f 7→ λ ‖f‖2H are continuous. If on

the other hand L is a (strictly) convex loss, then

RL?,P : H → [−∞,∞]

is (strictly) convex and

RregL?,P,λ : H → [−∞,∞]

is strictly convex. Also the Lipschitz continuity of L? can be related to the Lipschitz

continuity of its risk. For L a Lipschitz continuous loss and P a probability measure

on X × Y it has been shown that

|RL?,P(f)−RL?,P(g)| ≤ |L|1 · ‖f − g‖L1(PX) , f, g ∈ L∞(PX) .

The fact that for L a Lipschitz continuous loss we have RL?,P(f) /∈ {−∞, +∞} for

all f ∈ L1(PX) as well as RregL?,P,λ(f) > −∞ for all f ∈ L1(PX) ∩H ensures that the

optimization problem to determine fL?,P,λ is well-posed.

The following properties of support vector machines based on a shifted loss function

were derived under our General Assumption.

(i) For all λ > 0 there exists an SVM fL?,P,λ and it is unique.

2 See Appendix A.1

9

(ii) If L is even Frechet-differentiable, then the mapping

g : (X × Y)n → H, g(D) = fL,D,λ

is continuous. Therefore, the solution depends in a continuous manner on the input

data. The uniqueness and existence of the solution, together with the continuity

of this mapping ensures that the SVM is the solution of a well-posed problem in

Hadamard’s sense.

(iii) Assume that the RKHS H is dense in L1(µ) for all distributions µ on X . Then

SVMs are consistent in the sense

RL?,P(fL?,D,λn) −→P R∗L?,P , n →∞, (8)

i.e., the L?-risk of the SVM fL?,D,λnconverges to the Bayes risk which is defined

as the smallest possible risk over all measurable functions f : X → R. That the

universal consistency in (8) is valid for SVMs is somewhat astonishing at first glance

because fL?,D,λnis evaluated by minimizing a regularized empirical risk over the

RKHS H, whereas the Bayes risk is defined as the minimal non-regularized risk

over the broader set of all measurable functions f : X → R. To make this universal

consistency achievable, we made of course the denseness assumption on H.

(iv) If the null-sequence (λn)n∈N of regularization parameters fulfills λ2+δn n → ∞ for

some δ > 0, then the convergence in (8) even holds P∞-almost surely.

(v) In general, it is unclear whether this convergence of the risks also implies the con-

vergence of fL?,D,λnto a minimizer f∗L?, P corresponding to the Bayes risk R∗L?,P.

However, given some weak additional assumptions, this holds true for example for

the pinball loss function.

Before we list some statistical robustness properties of SVMs, we first recall the

definition of Hampel’s influence function which is a cornerstone of robust statistics.

Define the function T : M1(X × Y) → H as T (P) := fL?,P,λ. For M1(Z) the set

of distributions on some measurable space (Z,B(Z)) and H an RKHS, the influence

function (Hampel, 1968, 1974) of T : M1(Z) → H at z ∈ Z for a distribution P ∈M1(Z) is defined as

IF(z; T, P) = limε↓0

T ((1− ε)P + εδz)− T (P)

ε,

if the limit exists. In this approach of robust statistics, a statistical method T (P) is

called robust if its influence function (IF) is bounded.

(vi) The bias of the SVM increases at most linearly with the radius ε ∈ [0, 1] of a

mixture contamination neighbourhood around P. More precisely, it holds‚‚‚fL?,(1−ε)P+εQ,λ − fL?,P,λ

‚‚‚∞ ≤ ‖k‖∞‚‚‚fL?,(1−ε)P+εQ,λ − fL?,P,λ

‚‚‚H ≤ c · ε ,

where c = c(L?, k) is some known finite constant and Q a distribution that lies in

the mixture contamination neighbourhood.

(vii) The influence function IF(z; T, P) for SVMs based on a convex and Lipschitz con-

tinuous shifted loss function exists and is bounded if the first two partial Frechet-

derivatives of the loss function L are continuous and bounded. These results hold

for all probability measures P on X ×Y and for all z := (x, y) ∈ X ×Y. It is easily

verifiable that the assumptions on L are fulfilled for, e.g., the logistic loss function

for regression in combination with the Gaussian RBF kernel.

10

Unfortunately, the previously mentioned conditions on the existence of partial

Frechet-derivatives of the loss function are not fulfilled for some losses that are often

used in practice, such as the ε-insensitive loss or the pinball loss. In the next section

we will see how this gap can be filled.

3 Bouligand influence function for SVMs

Recently, the Bouligand influence function (BIF) was introduced as a new measure in

robust statistics, see Christmann and Van Messem (2008). The Bouligand influence

function is in particular useful to study robustness properties of statistical functionals

which are defined as minimizers of non-Frechet-differentiable objective functions. A

special case are support vector machines based on the ε-insensitive loss or on the

pinball loss. The Bouligand influence function of the mapping T : M1(X × Y) → Hfor a distribution P in the direction of a distribution Q 6= P is the special Bouligand-

derivative3 (if it exists)

limε↓0

‚‚T `(1−ε)P+εQ´−T (P)− BIF(Q; T, P)

‚‚Hε

= 0 .

The BIF thus measures the impact of an infinitesimal small amount of contamination

of the original distribution P in the direction of Q on the quantity of interest T (P).

Therefore, a bounded BIF of the function T is desirable.

The only new theoretical result we will prove in this review paper is the following

result about the robustness of SVMs. The theorem states that, given some conditions,

support vector machines are robust in the sense of the Bouligand influence function,

even when they are based on heavy-tailed distributions.

Theorem 2 (Bouligand influence function) Let X be a complete separable

normed linear space4 and H be an RKHS of a bounded, continuous kernel k. Let L

be a convex, Lipschitz continuous loss function with Lipschitz constant |L|1 ∈ (0,∞).

Let the partial Bouligand-derivatives ∇B3 L(x, y, · ) and ∇B

3,3L(x, y, · ) be measurable

and bounded by

κ1 := sup(x,y)

‚‚‚∇B3 L(x, y, · )

‚‚‚∞∈ (0,∞) , κ2 := sup(x,y)

‚‚‚∇B3,3L(x, y, · )

‚‚‚∞< ∞ . (9)

Let P and Q 6= P be probability measures on X × Y, δ1 > 0, δ2 > 0,

Nδ1(fL?,P,λ) := {f ∈ H :‚‚f − fL?,P,λ

‚‚H < δ1} ,

and λ > κ22 ‖k‖3∞. Define G : (−δ2, δ2)×Nδ1(fL?,P,λ) → H,

G(ε, f) := 2λf + E(1−ε)P+εQ∇B3 L?(X, Y, f(X))Φ(X) , (10)

and assume that ∇B2 G(0, fL?,P,λ) is strong. Then the Bouligand influence function

BIF(Q; T, P) of T (P) := fL?,P,λ exists, is bounded, and equals

S−1`EP∇B3 L?(X, Y, fL?,P,λ(X))Φ(X)

´− S−1`EQ∇B3 L?(X, Y, fL?,P,λ(X))Φ(X)

´,

(11)

3 See Appendix A.14 E.g., X ⊂ Rd closed. By definition of the Bouligand-derivative, X has to be a normed

linear space.

11

where S := ∇B2 G(0, fL?,P,λ) : H → H is given by

S( · ) = 2λ idH( · ) + EP∇B3,3L?(X, Y, fL?,P,λ(X))〈Φ(X), · 〉HΦ(X) .

Note that the Bouligand influence function of the SVM only depends on Q via the

second term in (11). For the ε-insensitive loss we have (κ1, κ2) = (1, 0) and for the

pinball loss (κ1, κ2) = (max{1− τ, τ}, 0).

4 Numerical considerations

To illustrate our robustness result, we include two numerical examples. The first exam-

ple treats some simulated data, based upon a Cauchy distribution, the second (real-life)

example is an insurance data set with extreme values. For both examples we used R (R

Development Core Team, 2009) for the numerical calculations.

4.1 Numerical example for simulated data

In the first example we try to predict the function f(x) = 50 sin(x/20) cos(x/10) + x

with an SVM. For this purpose, n = 1000 data points xi from a uniform distribution

U(−100, 100) have been generated. The corresponding output values yi were generated

by yi = f(xi)+εi, where εi were pseudo-random numbers from a Cauchy distribution.

The SVM-regression was done by using the ε-insensitive loss function and the Gaussian

RBF kernel as defined in the introduction. The computation was executed via the R

function svm from the library e1071. The hyperparameters (λ, ε, γ) of the SVM have

been determined by minimizing the L?-risk via a three dimensional grid search over

17 × 12 × 17 = 3468 knots, where λ is the regularization parameter of the SVM, ε is

the parameter of the ε-insensitive loss function, and γ is the parameter of the Gaussian

RBF kernel. The grid search resulted in the choice (λ, ε, γ) =`2−9/n, 2−8, 2−4´.

Figure 2 shows the fitted SVM. The upper sub-plot clearly shows that there are

extreme values in the data set which was intended because we simulated error terms

from a Cauchy distribution, but the SVM fits nevertheless the pattern set by the

majority of the data points quite well. Due to these extreme values, a relatively small

systematic bias of the SVM would not be visible. Therefore, we zoomed in on the

interesting part of the y-axis to get a better view of the plot, as shown in the lower

sub-plot of Figure 2. In this graph we see that there is almost no bias and that the

SVM neatly covers the true function. Hence, this numerical example confirms our

theoretical robustness result of a bounded Bouligand influence function: the SVM is a

good approximation of the true function even for heavy-tailed distributions with large

extreme values, if the sample size is large enough.

4.2 Numerical example for large fire insurance claims in Denmark

For the second example we used the data set ‘danish’ from the R package evir. This set

consists of 2167 fire insurance claims over 1 million Danish Krone (DKK) during the pe-

riod from Thursday, January 3rd, 1980 until Monday, December 31st, 1990. The claims

are total figures, i.e., they include damage to buildings, furniture and personal property

as well as loss of profits. The data were supplied by Mette Rytgaard of Copenhagen Re.

12

−100 −50 0 50 100

−10

00−

500

0

x

y

<−−

−100 −50 0 50 100

−15

0−

5050

150

x

y

Fig. 2 The upper sub-plot shows all data points, including all extreme values, as well asthe true function and SVM. The difference between both functions is hardly visible. The lowersub-plot zooms in on the y-axis. It shows that the bias between the SVM and the true functionis almost invisible in this example despite the extreme values.

These data form an irregular time series. The plot of the data shows that there really is

a time effect which we will use for our SVM purposes. We have done both classical least

squares regression as well as nonparametric conditional quantile regression by means of

SVMs. Time was the only explanatory variable. The SVM-regression was done by using

the pinball loss for different τ -values. We chose τ ∈ {0.50, 0.75, 0.90, 0.99, 0.995} since

we are interested in big claims. We used the Gaussian RBF kernel. The computations

were done with the function kqr from the package kernlab (Karatzoglou et al., 2004)

in R.

Figure 3 shows the fitted curves for the Danish data set. The least squares fit is

shown as a dotted line, the SVM-quantiles are drawn as solid curves. The upper sub-

plot shows the data with 3 remarkable large extreme claims, these being the claims

of over 100 million DKK. However, the SVM-quantiles appear to follow the bulk of

the points, and do not seem to be much attracted by these extremes. This is in good

agreement to our theoretical result stating that SVMs are robust. Due to the extremes,

the conditional quantile curves fitted by SVMs and the LS-fit are hardly distinguishable

in the upper plot. Therefore we zoomed in on the y-axis to get a better view. On

13

the second sub-plot, we can clearly see all curves, but no longer the extreme data

points. The third sub-plot zooms even further, showing only the 90%-quantile and

lower quantiles, as well as the least squares regression line. We see that the LS fit lies

above the 50% SVM-quantile curve (and in this case often even above the 75% SVM-

quantile curve) because it is more attracted towards the extreme values and hence is

less robust.

The Danish data set is well-known in the literature on extreme value theory and

is therefore used as a benchmark data set in the R package evir developed by Alexan-

der McNeil and Alec Stephenson. To demonstrate that the residuals of an SVM for

nonparametric median regression based on the pinball loss function with τ = 0.5 have

a distribution close to an extreme value distribution, we fitted a generalized Pareto

(GPD) distribution by the maximum likelihood method to the residuals, see Pickands

(1975) and Hosking and Wallis (1989) for details. The computations were done with the

function gpd from the evir package. Figure 4 shows that a GPD distribution, which

is well-known to have heavy tails, actually offers a good fit for these residuals. The

ML estimates for the two parameters of the GPD distribution are 0.498 (0.149) for the

shape parameter and 7.824 (1.377) for the scale parameter.

5 Discussion

Support vector machines are successful statistical methods for classification and re-

gression purposes to analyze even large data sets with unknown, complex, and high-

dimensional dependency structures. Recently, consistency and robustness results for

SVMs were derived for unbounded output spaces, given that the moment condition

EP|Y | < ∞ is fulfilled. However, this assumption excludes heavy-tailed distributions

such as some stable distributions, including the Cauchy distribution, as well as many

extreme value distributions which occur in financial or actuarial problems.

By shifting the loss function, it is possible to enlarge the applicability of support

vector machines to situations where the output space Y is unbounded, e.g., Y = R or

Y = [0,∞), without the need for the above mentioned moment condition on the con-

ditional distribution of Y given x. Under rather weak conditions which do not depend

on the data or on the actual distribution P, and are thus easy to verify, support vector

machines based on shifted loss functions exist, are unique, depend in a continuous man-

ner on the data set, are risk consistent and have good statistical robustness properties.

These conditions are satisfied, if the researcher uses a convex Lipschitz continuous loss

function, a bounded continuous kernel with a dense reproducing kernel Hilbert space,

and a null-sequence of positive regularization parameters λn such that λ2nn converges

to infinity. Even for regression purposes with an unbounded output space, support vec-

tor machines are robust with respect to many statistical aspects even for heavy-tailed

distributions like extreme value distributions or Cauchy distributions.

Appendix: Mathematical facts and proofs

A.1 Some facts on Bouligand-derivatives

We recall some facts on Bouligand-derivatives and strong approximation of functions, becausethese notions are used to investigate robustness properties for SVMs for nonsmooth loss func-tions in Theorem 2. Let E1, E2, W , and Z be normed linear spaces, and let us consider

14

050

100

150

200

250

dani

sh

1980 1982 1984 1986 1988 1990

<−−

−−> −−>

tau = 0.99tau = 0.995

010

2030

40

dani

sh

1980 1982 1984 1986 1988 1990

LS

tau = 0.50

tau = 0.75tau = 0.90

tau = 0.99

tau = 0.995

02

46

810

dani

sh

1980 1982 1984 1986 1988 1990

LS

tau = 0.50

tau = 0.75

tau = 0.90

Fig. 3 All data points, including the extreme values, are visible in the upper sub-plot. TheSVM-quantiles seem to follow the mass of the data points and are not attracted to the extremes.The lower sub-plots zoom in on different scales of the y-axis. Here both the LS regression (dot-ted line) and the SVM-quantiles are distinguishable. There seems to be almost no attractiontowards the extreme values for the SVM based quantile curves.

15

10 20 50 100 200

0.0

0.4

0.8

x (on log scale)

Fu(

x−u)

10 20 50 100 200

5e−

051e

−03

2e−

02

x (on log scale)

1−F

(x)

(on

log

scal

e)

0 20 40 60 80

01

23

45

Ordering

Res

idua

ls

0 1 2 3 4 5

01

23

45

Ordered Data

Exp

onen

tial Q

uant

iles

Fig. 4 Plots to check whether the residuals of the median regression based on SVMs with thepinball loss function have an approximate generalized Pareto distribution. Upper left: excessdistribution; Upper right: tail of the underlying distribution; Lower left: scatterplot of theresiduals; Lower right: qqplot of the residuals.

neighbourhoods N (x0) of x0 in E1, N (y0) of y0 in E2, and N (w0) of w0 in W . Let F and Gbe functions from N (x0) × N (y0) to Z, h1 and h2 functions from N (w0) to Z, f a functionfrom N (x0) to Z and g a function from N (y0) to Z. A function f approximates F in x at(x0, y0), written as f ∼x F at (x0, y0), if

F (x, y0)− f(x) = o(x− x0) .

Similarly, g ∼y F at (x0, y0) if F (x0, y)−g(y) = o(y−y0). A function h1 strongly approximatesh2 at w0, written as h1 ≈ h2 at w0, if for each ε > 0 there exists a neighbourhood N (w0) ofw0 such that whenever w and w′ belong to N (w0),

‚‚`h1(w)− h2(w)´− `h1(w′)− h2(w′)

´‚‚ ≤ ε‚‚w − w′

‚‚ .

A function f strongly approximates F in x at (x0, y0), written as f ≈x F at (x0, y0), if foreach ε > 0 there exist neighbourhoods N (x0) of x0 and N (y0) of y0 such that whenever x andx′ belong to N (x0) and y belongs to N (y0) we have

‚‚`F (x, y)− f(x)´− `F (x′, y)− f(x′)

´‚‚ ≤ ε‚‚x− x′

‚‚ .

Strong approximation amounts to requiring h1−h2 to have a strong Frechet-derivative of 0 atw0, though neither h1 nor h2 is assumed to be differentiable in any sense. A similar definitionis made for strong approximation in y. We define strong approximation for functions of severalgroups of variables, for example G ≈(x,y) F at (x0, y0), by replacing W by E1 × E2 andmaking the obvious substitutions. Note that one has both f ≈x F and g ≈y F at (x0, y0)exactly if f(x) + g(y) ≈(x,y) F at (x0, y0).

Recall that a function f : E1 → Z is called positive homogeneous if

f(αx) = αf(x) ∀α ≥ 0, ∀x ∈ E1 .

16

Following Robinson (1987) we can now define the Bouligand-derivative. Given a functionf from an open subset U of a normed linear space E1 into another normed linear space Z, wesay that f is Bouligand-differentiable at a point x0 ∈ U , if there exists a positive homogeneousfunction ∇Bf(x0) : U → Z such that f(x0 + h) = f(x0) +∇Bf(x0)(h) + o(h), which can berewritten as

limh→0

‚‚f(x0 + h)− f(x0)−∇Bf(x0)(h)‚‚

Z

‖h‖E1

= 0 .

Let F : E1 × E2 → Z, and suppose that F has a partial Bouligand-derivative5 ∇B1 F (x0, y0)

with respect to x at (x0, y0). We say ∇B1 F (x0, y0) is strong if

F (x0, y0) +∇B1 F (x0, y0)(x− x0) ≈x F at (x0, y0) .

If the Frechet-derivative exists, then the Bouligand-derivative exists and both are equal, butthe converse is in general not true.

A.2 Proof of Theorem 2

The proof of Theorem 2 is based upon the following implicit function theorem for Bouligand-derivatives, see Robinson (1991, Cor. 3.4). For a function f from a metric space (E1, dE1 ) toanother metric space (E2, dE2 ), we define

δ(f, E1) = inf{dE2

`f(x1), f(x2)

´/ dE1 (x1, x2) | x1 6= x2; x1, x2 ∈ E1} .

Theorem 3 Let E2 be a Banach space and E1 and Z be normed linear spaces. Let x0 and y0

be points of E1 and E2, respectively, and let N (x0) be a neighborhood of x0 and N (y0) be aneighborhood of y0. Suppose that G is a function from N (x0)×N (y0) to Z with G(x0, y0) = 0.In particular, for some φ and each y ∈ N (y0), G( · , y) is assumed to be Lipschitz continuouson N (x0) with modulus φ. Assume that G has partial Bouligand-derivatives with respect tox and y at (x0, y0), and that: (i) ∇B

2 G(x0, y0)( · ) is strong. (ii) ∇B2 G(x0, y0)(y − y0) lies in

a neighborhood of 0 ∈ Z, ∀y ∈ N (y0). (iii) δ(∇B2 G(x0, y0),N (y0) − y0) =: d0 > 0. Then

for each ξ > d−10 φ there are neighborhoods U of x0 and V of y0, and a function f∗ : U → V

satisfying (a) f∗(x0) = y0. (b) f∗ is Lipschitz continuous on N (x0) with modulus ξ. (c)For each x ∈ U , f∗(x) is the unique solution in V of G(x, y) = 0. (d) The function f∗ isBouligand-differentiable at x0 with

∇Bf∗(x0)(u) =“∇B

2 G(x0, y0)”−1 “−∇B

1 G(x0, y0)(u)”

.

The following consequence of the open mapping theorem, see Lax (2002, p. 170), is alsoneeded for the proof of Theorem 2.

Theorem 4 Let E1 and E2 be Banach spaces, A : E1 → E2 be a bounded, linear, and bijectivefunction. Then the inverse A−1 : E2 → E1 is a bounded linear function.

Furthermore, we need the following well-known inequalities

‖f‖∞ ≤ ‖k‖∞ ‖f‖H and ‖Φ(x)‖∞ ≤ ‖k‖∞ ‖Φ(x)‖H ≤ ‖k‖2∞ (12)

for all f ∈ H and all x ∈ X . These follow from the reproducing property and the fact that‖Φ(x)‖H =

pk(x, x).

Remark 5 In order to prove Theorem 2, we will show that, under the same assumptions, wehave:

(i) For some χ and each f ∈ Nδ1 (fL?,P,λ), G( · , f) is Lipschitz continuous on (−δ2, δ2) withLipschitz constant χ.

(ii) G has partial Bouligand-derivatives with respect to ε and f at (0, fL?,P,λ).

5 Partial Bouligand-derivatives of f are denoted by ∇B1 f , ∇B

2 f , ∇B2,2f := ∇B

2

`∇B2 f´

etc.

17

(iii) ∇B2 G(0, fL?,P,λ)(h− fL?,P,λ) lies in a neighborhood of 0 ∈ H, ∀h ∈ Nδ1 (fL?,P,λ).

(iv) d0 := infh1,h2∈Nδ1 (fL,P,λ)−fL,P,λ ; h1 6=h2

‖∇B2 G(0,fL,P,λ)(h1)−∇B

2 G(0,fL,P,λ)(h2)‖H‖h1−h2‖H

> 0.

(v) For each ξ > d−10 χ there exist constants δ3, δ4 > 0, a neighborhood Nδ3 (fL?,P,λ) :=

{f ∈ H;‚‚f − fL?,P,λ

‚‚H < δ3}, and a function f∗ : (−δ4, δ4) → Nδ3 (fL?,P,λ) satisfying

(v.1) f∗(0) = fL?,P,λ.(v.2) f∗ is Lipschitz continuous on (−δ4, δ4) with Lipschitz constant |f∗|1 = ξ.(v.3) For each ε ∈ (−δ4, δ4) is f∗(ε) the unique solution of G(ε, f) = 0 in Nδ3 (fL?,P,λ).

(v.4) ∇Bf∗(0)(u) =`∇B

2 G(0, fL?,P,λ)´−1 `−∇B

1 G(0, fL?,P,λ)(u)´, u ∈ (−δ4, δ4).

The function f∗ is the same as the one in Theorem 3.

The main idea of the proof of Theorem 2 is to define the map G : R × H → H definedby (10). For ε < 0 the integration is w.r.t. a signed measure. The H-valued expectation thatis used in the definition of G is well-defined for all ε ∈ (δ2, δ2) and all f ∈ Nδ1 (fL,P,λ),

since κ1 ∈ (0,∞) by (9) and ‖Φ(x)‖∞ ≤ ‖k‖2∞ < ∞ by (12). Bouligand-derivatives fulfilla chain rule and Frechet-differentiability implies Bouligand-differentiability. For ε ∈ [0, 1] wethus obtain

G(ε, f) =∂Rreg

L?,(1−ε)P+εQ,λ

∂H (f) = ∇B2 Rreg

L?,(1−ε)P+εQ,λ(f) . (13)

The map f 7→ RregL?,(1−ε)P+εQ,λ

(f) is convex and continuous for all ε ∈ [0, 1] and therefore

equation (13) shows that G(ε, f) = 0 if and only if f = fL?,(1−ε)P+εQ,λ for such ε. Hence

G(0, fL?,P,λ) = 0 . (14)

By using the steps in Remark 5, we will show that Theorem 3 is applicable for G and thatthere exists a Bouligand-differentiable function ε 7→ fε defined on a small interval (−δ2, δ2) forsome δ2 > 0 satisfying G(ε, fε) = 0 for all ε ∈ (−δ2, δ2). Due to the existence of this function,we obtain that BIF(Q; T, P) = ∇Bfε(0). Now we can give the proof of our main result.

Proof of Theorem 2. In Section 2 we have seen that fL?,P,λ exists and is unique. The as-sumption that G(0, fL?,P,λ) = 0 is valid by (14). We will now prove the results of Remark 5parts (i) to (v).

Remark 5 part (i). For a fixed f ∈ H, take ε1, ε2 ∈ (−δ2, δ2). Using ‖k‖∞ < ∞ and (14)we obtain

˛E(1−ε1)P+ε1Q∇B

3 L?(X, Y, f(X))Φ(X)− E(1−ε2)P+ε2Q∇B3 L?(X, Y, f(X))Φ(X)

˛

=˛(ε1 − ε2)EQ−P∇B

3 L?(X, Y, f(X))Φ(X)˛

≤ |ε1 − ε2|Z ˛˛∇B

3 L?(x, y, f(x))Φ(x)˛˛ d|Q− P|(x, y)

≤ |ε1 − ε2|Z

sup(x,y)∈(X ,Y)

|∇B3 L?(x, y, f(x))| sup

x∈X|Φ(x)| d|Q− P|(x, y)

≤ |ε1 − ε2| ‖Φ(x)‖∞ sup(x,y)∈(X ,Y)

‚‚‚∇B3 L?(x, y, · )

‚‚‚∞

Zd|Q− P|(x, y)

≤ 2 ‖k‖2∞ sup(x,y)∈(X ,Y)

‚‚‚∇B3 L?(x, y, · )

‚‚‚∞|ε1 − ε2|

= 2 ‖k‖2∞ κ1 |ε1 − ε2| < ∞ .

Remark 5 part (ii). We obtain

∇B1 G(ε, f) = ∇B

1

“E(1−ε)P+εQ∇B

3 L?(X, Y, f(X))Φ(X)”

= ∇B1

“EP∇B

3 L?(X, Y, f(X))Φ(X) + εEQ−P∇B3 L?(X, Y, f(X))Φ(X)

= EQ−P∇B3 L?(X, Y, f(X))Φ(X)

= EQ∇B3 L?(X, Y, f(X))Φ(X)− EP∇B

3 L?(X, Y, f(X))Φ(X) . (15)

18

These expectations exists due to (12) and (9). We also have

∇B2 G(0, fL?,P,λ)(h) + o(h)

= G(0, fL?,P,λ + h)−G(0, fL?,P,λ)

= 2λh + EP∇B3 L?(X, Y, (fL?,P,λ(X) + h(X)))Φ(X)− EP∇B

3 L?(X, Y, fL?,P,λ(X))Φ(X)

= 2λh + EP

“∇B

3 L?`X, Y, (fL?,P,λ(X) + h(X))

´−∇B3 L?

`X, Y, fL?,P,λ(X)

´”Φ(X) .

Since the term ∇B3 L?

`X, Y, (fL?,P,λ(X) + h(X))

´−∇B3 L?

`X, Y, fL?,P,λ(X)

´is bounded due

to (12), (9), and ‖k‖∞ < ∞, this expectation exists. Using 〈Φ(x), · 〉H ∈ H for all x ∈ X ,weget

∇B2 G(0, fL?,P,λ)( · ) = 2λidH( · ) + EP∇B

3,3L?(X, Y, fL?,P,λ(X))〈Φ(X), · 〉HΦ(X) . (16)

Note that EP∇B3,3L?

`X, Y, f(X)

´= ∇B

3 EP∇B3 L?

`X, Y, f(X)

´, because by using the defi-

nition of the Bouligand-derivative, we obtain for the partial Bouligand-derivatives

∇B3 EP∇B

3 L?`X, Y, f(X)

´− EP∇B3,3L?

`X, Y, f(X)

´

= EP

`∇B3 L?(X, Y, (f(X) + h(X)))−∇B

3 L?(X, Y, f(X))´− EP∇B

3,3L?(X, Y, f(X)) + o(h)

= EP

`∇B3 L?(X, Y, (f(X) + h(X)))−∇B

3 L?(X, Y, f(X))−∇B3,3L?(X, Y, f(X))

´+ o(h)

= o(h) , h ∈ H .

Remark 5 part (iii). LetNδ1 (fL?,P,λ) be a δ1-neighborhood of fL?,P,λ. SinceH is an RKHS

and therefore a vector space, we have for all h ∈ Nδ1 (fL?,P,λ) that‚‚fL?,P,λ − h− 0

‚‚H ≤ δ1

and thus h − fL?,P,λ ∈ Nδ1 (0) ⊂ H. Note that ∇B2 G(0, fL?,P,λ)( · ) as in (16) is a mapping

from H to H. For ξ := h − fL?,P,λ we have that ‖ξ‖H ≤ δ1 and the reproducing propertyyields

∇B2 G(0, fL?,P,λ)(ξ) = 2λξ + EP∇B

3,3L?(X, Y, fL?,P,λ(X))ξΦ(X) .

By using (12) and (9) we get

‚‚‚2λξ + EP∇B3,3L?(X, Y, fL?,P,λ(X))ξΦ(X)− 0

‚‚‚H

≤ 2λ ‖ξ‖H +‚‚‚EP∇B

3,3L?(X, Y, fL?,P,λ(X))ξΦ(X)‚‚‚H

≤ 2λ ‖ξ‖H + sup(x,y)∈(X ,Y)

‚‚‚∇B3,3L?(x, y, · )

‚‚‚∞‖ξ‖∞ ‖Φ(x)‖∞

≤ 2λ ‖ξ‖H + κ2 ‖ξ‖H ‖k‖3∞≤ `2λ + κ2 ‖k‖3∞

´δ1 ,

and thus ∇B2 G(0, fL?,P,λ)(h−fL?,P,λ) is in a neighborhood of 0 ∈ H, for all h ∈ Nδ1 (fL?,P,λ).

Remark 5 part (iv). Due to (16) we have to prove that

d0 := inff1 6=f2

‚‚‚2λ(f1 − f2) + EP∇B3,3L?

`X, Y, fL?,P,λ(X)

´(f1 − f2)Φ(X)

‚‚‚H

‖f1 − f2‖H> 0 .

For f1 6= f2, the assumption λ > 12

κ2 ‖k‖3∞, (12), and (9) yield that

‚‚‚2λ(f1 − f2) + EP∇B3,3L?

`X, Y, fL?,P,λ(X)

´(f1 − f2)Φ(X)

‚‚‚H

/ ‖f1 − f2‖H≥ `‖2λ(f1 − f2)‖H −

‚‚‚EP∇B3,3L?(X, Y, fL?,P,λ(X))(f1 − f2)Φ(X)

‚‚‚H´/ ‖f1 − f2‖H

≥ 2λ− κ2 ‖k‖3∞ > 0 ,

which gives the assertion.Remark 5 part (v). The assumptions of Robinson’s implicit function theorem, see Theorem

3, are valid for G due to the results of Remark 5 parts (i) to (iv) and the assumption that∇B

2 G(0, fL?,P,λ) is strong. This gives part (v).

19

The result of Theorem 2 now follows from inserting (15) and (16) into Remark 5 part (v.4).Due to (9) and (7) it is clear that S is bounded. The linearity of S follows from its definitionand the inverse of S does exist by Theorem 3. If necessary we can restrict the range of S toS(H) to obtain a bijective function S∗ : H → S(H) with S∗(f) = S(f) for all f ∈ H. Theorem4 then gives us that S−1 is also bounded and linear. This yields the existence of a boundedBIF specified by (11). ¤

References

Christmann, A. and Steinwart, I. (2004). On robust properties of convex risk minimizationmethods for pattern recognition. Journal of Machine Learning Research, 5, 1007–1034.

Christmann, A. and Steinwart, I. (2007). Consistency and robustness of kernel basedregression in convex minimization. Bernoulli , 13, 799–819.

Christmann, A. and Steinwart, I. (2008). Consistency of kernel based quantile regression.Appl. Stoch. Models Bus. Ind., 24, 171–183.

Christmann, A. and Van Messem, A. (2008). Bouligand Derivatives and Robustness ofSupport Vector Machines for Regression. Journal of Machine Learning Research, 9, 915–936.

Christmann, A., Van Messem, A., and Steinwart, I. (2009). On Consistency and Robust-ness Properties of Support Vector Machines for Heavy-Tailed Distributions. Statistics andIts Interface, 2, 311–327.

Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines.Cambridge University Press, Cambridge.

Hampel, F. R. (1968). Contributions to the theory of robust estimation. Unpublished Ph.D.thesis, Dept. of Statistics, University of California, Berkeley.

Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of theAmerican Statistical Association, 69, 383–393.

Hosking, J. and Wallis, J. (1989). Parameter and quantile estimation for the generalizedpareto distribution. Technometrics, (29), 339–349.

Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandardconditions. Proceedings of the 5th Berkeley Symposium, 1, 221–233.

Karatzoglou, A., Smola, A., Hornik, K., and Zeileis, A. (2004). kernlab – an S4 packagefor kernel methods in R. Journal of Statistical Software, 11(9), 1–20.

Koenker, R. (2005). Quantile Regression. Cambridge University Press, New York.Lax, P. D. (2002). Functional Analysis. Wiley & Sons, New York.Pickands, J. (1975). Statistical inference using extreme order statistics. Ann. Statist., (3),

119–131.R Development Core Team (2009). R: A Language and Environment for Statistical Com-

puting. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.Robinson, S. M. (1987). Local structure of feasible sets in nonlinear programming, Part III:

Stability and sensitivity. Mathematical Programming Study, 30, 45–66.Robinson, S. M. (1991). An implicit-function theorem for a class of nonsmooth functions.

Mathematics of Operations Research, 16, 292–309.Scholkopf, B. and Smola, A. J. (2002). Learning with Kernels. Support Vector Machines,

Regularization, Optimization, and Beyond . MIT Press, Cambridge, Massachusetts.Steinwart, I. and Christmann, A. (2008a). How SVMs can estimate quantiles and the

median. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in NeuralInformation Processing Systems 20 . MIT Press, Cambridge, Massachusetts.

Steinwart, I. and Christmann, A. (2008b). Support Vector Machines. Springer, New York.Vapnik, V. N. (1998). Statistical Learning Theory. Wiley & Sons, New York.