Clustering with Normalized Cuts is Clustering with a Hyperplane
A review of robust clustering methods
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of A review of robust clustering methods
Advances in Data Analysis and Classification manuscript No.(will be inserted by the editor)
A Review on Consistency and Robustness Properties ofSupport Vector Machines for Heavy-Tailed Distributions
Arnout Van Messem · Andreas Christmann
Received: date / Accepted: date
Abstract Support vector machines (SVMs) belong to the class of modern statistical
machine learning techniques and can be described as M-estimators with a Hilbert norm
regularization term for functions. SVMs are consistent and robust for classification and
regression purposes if based on a Lipschitz continuous loss and a bounded continu-
ous kernel with a dense reproducing kernel Hilbert space. For regression, one of the
conditions used is that the output variable Y has a finite first absolute moment. This
assumption, however, excludes heavy-tailed distributions. Recently, the applicability of
SVMs was enlarged to these distributions by considering shifted loss functions. In this
review paper, we briefly describe the approach of SVMs based on shifted loss functions
and list some properties of such SVMs. Then, we prove that SVMs based on a bounded
continuous kernel and on a convex and Lipschitz continuous, but not necessarily dif-
ferentiable, shifted loss function have a bounded Bouligand influence function for all
distributions, even for heavy-tailed distributions including extreme value distributions
and Cauchy distributions. SVMs are thus robust in this sense. Our result covers the
important loss functions ε-insensitive for regression and pinball for quantile regression,
which were not covered by earlier results on the influence function. We demonstrate
the usefulness of SVMs even for heavy-tailed distributions by applying SVMs to a sim-
ulated data set with Cauchy errors and to a data set of large fire insurance claims of
Copenhagen Re.
Keywords Regularized empirical risk minimization · support vector machines ·consistency · robustness · Bouligand influence function · heavy tails.
Mathematics Subject Classification (2000) 68Q32 · 62G35 · 62G08 · 62F35 ·68T10.
A. Van MessemDepartment of Mathematics, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, BelgiumPhone: +32-2-6293498Fax: +32-2-6293495E-mail: E-mail: [email protected]
A. ChristmannDepartment of Mathematics, University of Bayreuth, 95440 Bayreuth, GermanyE-mail: E-mail: [email protected]
2
1 Introduction and Overview
Support vector machines belong to the large class of techniques from modern sta-
tistical machine learning theory. Informally, SVMs are M-estimators with a Hilbert
norm regularization term for functions. SVMs are used both for classification and
regression purposes. Non-parametric regression aims to estimate a functional relation-
ship between an X -valued input random variable X and a Y-valued output random
variable Y , assuming that the joint distribution P of (X, Y ) is (almost) completely
unknown. Common choices of input and output spaces are X × Y = Rd × {−1, +1}in the binary classification case and X × Y = Rd × R in the regression case. In or-
der to model this relationship one typically assumes that one has a training data set
D = ((x1, y1), . . . , (xn, yn)) ∈ (X × Y)n consisting of observations from independent
and identically distributed (i.i.d.) random variables (Xi, Yi), i = 1, . . . , n, which all
have the same distribution P. The goal is to build a function f : X → Y based on these
observations which will assign to each risk vector x a prediction f(x) which hopefully
will be a good approximation of the observed output y.
The goal of this article is to review some properties of SVMs and to show that SVMs
can even successfully deal with heavy-tailed conditional distributions of the response
variable Y given x. Therefore, we will mainly consider regression. Of course, the case of
classification where the output space Y is just a set of a finite number of real numbers
is a special case and thus classification is covered by the results we will show. The
problem of heavy tails is however not present in classification because the conditional
distribution of the response variable Y given x has then obviously a bounded support.
We will now motivate why support vector machines are defined as they are, for
the precise definition we refer to Definition 1, and describe the main goals of SVMs.
We refer to Vapnik (1998), Cristianini and Shawe-Taylor (2000), Scholkopf and Smola
(2002), and Steinwart and Christmann (2008b) for textbooks on SVMs and related
topics. To formalize the aim of estimating the predictor function f , we call a function
L : X × Y × R → [0,∞) a loss function (or just loss) if L is measurable. The loss
function assesses the quality of a prediction f(x) for an observed output value y by
L(x, y, f(x)). We follow the convention that the smaller L(x, y, f(x)) is the better the
prediction is. We will also assume that L(x, y, y) = 0 for all y ∈ Y, since practitioners
usually argue that the loss is zero, if the forecast f(x) equals the observed value y. We
call L (or L?) convex, continuous, Lipschitz continuous, or differentiable, if L (or L?)
has this property with respect to its third argument. E.g., L is Lipschitz continuous if
there exists a constant |L|1 ∈ (0,∞) such that, for all (x, y) ∈ X ×Y and all t1, t2 ∈ R,
|L(x, y, t1)− L(x, y, t2)| ≤ |L|1 |t1 − t2|.Some popular loss functions for regression are the ε-insensitive loss
L(x, y, t) := max{0, |y − t| − ε}for some ε > 0 , Huber’s loss
L(x, y, t) :=
(0.5(y − t)2 if |y − t| ≤ α
α|y − t| − 0.5α2 if |y − t| > α
for some α > 0, and the logistic loss
L(x, y, t) := − ln4 exp(y − t)
(1 + exp(y − t))2
3
for regression. The pinball loss
L(x, y, t) :=
((τ − 1)(y − t), if y − t < 0 ,
τ(y − t), if y − t ≥ 0 ,
where τ > 0, is useful for quantile regression and often used in econometrics, see e.g.,
Koenker (2005). Plots of these loss functions can be seen in Figure 1. A classical loss
function for classification purposes is the hinge loss function
L(x, y, t) = max{0, 1− yt}, (x, y, t) ∈ X × {−1, +1} ×R.
All five loss functions mentioned above are convex and Lipschitz continuous, but
only the logistic loss is twice continuously (Frechet-)differentiable. Note that both the
ε-insensitive loss and the pinball loss are not (Frechet-)differentiable.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
eps−insensitive Pinball
−3 −2 −1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Huber
−3 −2 −1 0 1 2 3
Logistic
y−t
loss
Fig. 1 The plot shows some commonly used loss functions for regression: ε-insensitive losswith ε = 0.5, pinball loss with τ = 0.7, Huber loss with α = 0.5, and logistic loss.
Of course, in order to assess the quality of a predictor f it is not sufficient to only
know the value L(x, y, f(x)) for a particular choice of (x, y), but in fact we need to
quantify how small the function (x, y) 7→ L(x, y, f(x)) is. There are various ways to do
4
this, but a common choice in statistical learning theory is to consider the expected loss
of f , also called the L-risk, defined by
RL,P(f) := EPL(X, Y, f(X)) .
The learning goal is then to find a function f∗ that (approximately) achieves the
smallest possible risk, i.e., the Bayes risk
R∗L,P := inf{RL,P(f) ; f : X → R measurable} . (1)
However, since the distribution P is unknown, the risk RL,P(f) is also unknown and
consequently we cannot compute f∗. To resolve this problem, we can replace P by
the empirical distribution D = 1n
Pni=1 δ(xi,yi) corresponding to the data set D. Here
δ(xi,yi) denotes the Dirac distribution in (xi, yi). This way we obtain the empirical
L-risk
RL,D(f) :=1
n
nX
i=1
L(xi, yi, f(xi)) .
Although RL,D(f) can be considered as an approximation of RL,P(f) for each
single f , solving inff :X→RRL,D(f) will in general not result in a good approximate
minimizer of RL,P( · ) due to the effect of overfitting, in which the learning method will
model the data D too closely and thus will have a poor performance on future data.
One way to reduce the danger of overfitting is to not consider all measurable func-
tions f : X → R but to choose a smaller, but still reasonably rich, class F of functions
that is assumed to contain a good approximation of the solution of (1). Instead of then
minimizing RL,D( · ) over all measurable functions, one minimizes only over F , i.e.,
one solves
inff∈F
RL,D(f) . (2)
This approach, called empirical risk minimization (ERM), often tends to produce ap-
proximate solutions of
R∗L,P,F := inff∈F
RL,P(f) . (3)
There are, however, two serious issues with ERM. The first one is that our knowledge
of the distribution P is often not good enough to identify a set F such that a solution of
(3) is a reasonably good approximation of a solution of (1). Or, the other way around,
we usually cannot guarantee that the model error R∗L,P,F −R∗L,P is sufficiently small.
The second problem is that solving (2) might be computationally infeasible.
A first step of SVMs to make the optimization problem computationally feasible is
to use a convex loss function, because it is easy to show that then the risk functional
f 7→ RL,P(f) is convex. If we further assume that the set F of functions over which we
will have to optimize is also convex, the learning method defined by (2) will become a
convex optimization problem.
A second step of SVMs towards computational feasibility is to consider only a
very specific set of functions, namely the reproducing kernel Hilbert space (RKHS) Hof some measurable kernel k : X × X → R. Every RKHS is uniquely defined by its
kernel k and vice versa. The kernel can be used to describe all functions contained
in H. Moreover, the value k(x, x′) can often be interpreted as a measure of similarity
or dissimilarity between the two risk vectors x and x′. The function Φ : X → H,
Φ(x) := k( · , x) is called the canonical feature map. The reproducing property states
5
that f(x) = 〈f, Φ(x)〉H for all f ∈ H and all x ∈ X . A kernel k is called bounded, if
‖k‖∞ := sup{p
k(x, x) : x ∈ X} < ∞ .
A third step of SVMs towards computational feasibility – and also to uniqueness of
the SVM – is to use a special Hilbert norm regularization term which we will describe
now. It is well-known that the sum of a convex function and of a strictly convex function
over a convex set is strictly convex. Let us fix an RKHS H over the input space X and
denote the norm in H by ‖ · ‖H. The regularization term λ ‖f‖2H also serves to reduce
the danger of overfitting. It will penalize rather complex functions f which model the
output values in the training set D too closely, since these functions will have a large
RKHS norm. The term
RregL,P,λ(f) := RL,P(f) + λ ‖f‖2H
is called the regularized L-risk.
Definition 1 Let X be the input space, Y ⊂ R be the output space, L : X×Y×R→ R
be a loss function, H be a reproducing kernel Hilbert space of functions from X to R,
and λ > 0 be a constant. For a probability distribution P on X ×Y, a support vector
machine is defined as the minimizer
fL,P,λ := arg inff∈H
RL,P(f) + λ ‖f‖2H . (4)
The empirical SVM will be denoted by
fL,D,λ := arg inff∈H
RL,D(f) + λ ‖f‖2H = arg inff∈H
1
n
nX
i=1
L(xi, yi, f(xi)) + λ ‖f‖2H . (5)
The reasons to consider convex losses are fourfold. Given some weak assumptions,
SVMs based on a convex loss function have at least the following desirable properties,
see e.g., Vapnik (1998), Cristianini and Shawe-Taylor (2000), Scholkopf and Smola
(2002), and Steinwart and Christmann (2008b) for details. (i) The objective function
in (4) becomes convex in f and a support vector machine fL,D,λ exists, is unique,
and depends continuously on the data points in basically all situations of interest. The
SVM is therefore the solution of a well-posed optimization problem in Hadamard’s
sense. Moreover, this minimizer is of the form
fL,D,λ =
nX
i=1
αik(xi, · ) , (6)
where k is the kernel corresponding to H and the αi ∈ R, i = 1 . . . n, are suitable
coefficients. We see that the minimizer fL,D,λ is thus a weighted sum of (at most) n
kernel functions k(xi, · ), where the weights αi are data-dependent. A consequence of
(6) is that the SVM fL,D,λ is contained in a finite dimensional space, even if the space
H itself is considerably larger. This observation makes it possible to consider even
infinite dimensional spaces H such as the one corresponding to the popular Gaussian
radial basis function (RBF) kernel defined by
kRBF(x, x′) = exp(−γ−2 ‖x− x′‖2) , x, x′ ∈ Rd ,
where γ is a positive constant called the width. (ii) SVMs are under very weak assump-
tions L-risk consistent, i.e., for suitable null-sequences (λn) with λn > 0 we have that
6
RL,P(fL,D,λn) converges in probability to the Bayes risk R∗L,P for n →∞. (iii) SVMs
have good statistical robustness properties, if k is continuous and bounded and if L is
Lipschitz continuous with respect to its third argument, see Christmann and Steinwart
(2004, 2007), Christmann and Van Messem (2008), and Christmann et al. (2009). In
a nutshell, statistical robustness implies that the support vector machine fL,P,λ only
varies in a smooth and bounded manner if P changes slightly in the set M1 of all prob-
ability measures on X ×Y. (iv) There exist efficient numerical algorithms to determine
the vector of the weights1 α = (α1, . . . , αn) in the empirical representation (6) and
therefore also the SVM fL,D,λ even for large and high-dimensional data sets D. From
a numerical point of view, the vector α is usually computed as a solution of the convex
dual problem derived from a Lagrange approach.
The rest of the paper is organized as follows. In Section 2 we introduce the simple
concept of shifted loss functions L? and we explain their importance to investigate
statistical properties of support vector machines for all probability measures P, because
no moment assumptions on the conditional distribution of Y given x are needed to
guarantee the existence of SVMs. We will summarize some recently derived facts on
the loss L?, on the risk RL?,P(fL?,P,λ), and on consistency and robustness properties
of SVMs based on L?. Section 3 contains an important result, namely that the SVM
fL?,P,λ has a bounded Bouligand influence function under weak assumptions, even
for non-Frechet-differentiable loss functions L?. Hence the SVM fL?,P,λ is robust in
this sense. Section 4 illustrates the robustness properties of the SVM for heavy-tailed
distributions by giving some numerical examples and Section 5 contains a discussion.
The proof of the theoretical result of Section 3, together with some facts on Bouligand-
derivatives, is given in the Appendix.
2 SVMs based on shifted loss functions
The goal of this section is twofold. (i) We describe why SVMs for response variables
with heavy tails need a careful consideration. (ii) We show that SVMs based on shifted
loss functions L? are defined for all distributions and that they are identical to SVMs
based on unshifted loss functions for all data sets. This section reviews some results
derived on SVMs by Christmann and Steinwart (2008), Christmann and Van Messem
(2008), Steinwart and Christmann (2008a,b), and Christmann et al. (2009).
Although many of the results hold in a more general setting, we will make in this
review the following assumptions for the rest of the paper for three reasons: (i) these
assumptions are weak enough to cover many support vector machines used in practice,
(ii) it avoids listing special assumptions for each result, and (iii) these assumptions
are independent of the data set, i.e., they can really be checked.
General Assumption:
Let n ∈ N, X be a complete separable metric space (e.g., X = Rd), Y ⊂ R be a
non-empty and closed set (e.g., Y = R or Y = {−1, +1}), and P be a probability
distribution on X ×Y equipped with its Borel σ-algebra. Let L : X ×Y ×R→ [0,∞) be
1 The vector α is in general not unique. This is easy to see from (6) for the case that twodata points are identical, say (xi, yi) = (xj , yj) for some pair of indices i 6= j. However, theSVM fL,D,λ exists and is unique under the very weak assumptions we will list in the nextsection.
7
a convex and Lipschitz continuous loss function and denote its shifted loss function
L? : X × Y ×R→ R by
L?(x, y, t) := L(x, y, t)− L(x, y, 0) .
Let k : X ×X → R be a continuous and bounded kernel with reproducing kernel Hilbert
space H of functions f : X → R. The canonical feature map is denoted by Φ : X → H,
where Φ(x) := k( · , x) for x ∈ X . Let λ ∈ (0,∞) and (λn)n∈N be a null-sequence of
real numbers with λn ∈ (0,∞) and λ2nn →∞ for n →∞.
Since L(x, y, t) ∈ [0,∞), it follows from the definition that −∞ < L?(x, y, t) < ∞.
Because we assumed Y ⊂ R to be closed, the probability measure P can be split up
into the marginal distribution PX on X and the conditional probability P(y|x) on Y.
It has been shown that the L-risk RL,P(f) is finite, if
Z
X|f(x)| dPX(x) < ∞ and EP|Y | =
Z
X
Z
Y|y| dP(y|x) dPX(x) < ∞ ,
where the first condition is noted in short as f ∈ L1(PX). The latter condition, which
is one of the assumptions made for consistency and robustness proofs of SVMs for
an unbounded output set Y, unfortunately excludes heavy-tailed distributions such as
many stable distributions, including Cauchy distributions, as well as many extreme
value distributions which occur in financial or actuarial problems.
To enlarge the applicability of SVMs to heavy-tailed distributions, it has been
proposed to shift the loss L(x, y, t) downwards by the amount of L(x, y, 0) ∈ [0,∞), see
e.g., Huber (1967) for an early use of this trick for M-estimators (without regularization
term) in robust statistics. For brevity, we will call this the “L?-trick”. It was shown
that many important results on the SVM fL,P,λ using the traditional non-negative loss
function L also hold true for the SVM
fL?,P,λ := arg inff∈H
RL?,P(f) + λ ‖f‖2H ,
based on the shifted loss function. Here,
RL?,P(f) := EPL?(X, Y, f(X))
denotes the L?-risk of f . Moreover, it was shown that
fL?,P,λ = fL,P,λ if fL,P,λ exists,
which has the great advantage that no new algorithms are needed to compute fL?,D,λ
because the empirical SVM fL,D,λ exists for all data sets D = ((x1, y1), . . . , (xn, yn)) ⊂(X × Y)n. It was also shown that the L?-risk RL?,P(f) is finite for all f ∈ L1(PX),
no matter whether the moment condition EP|Y | < ∞ is fulfilled or violated. The
advantage of fL?,P,λ over fL,P,λ is therefore that fL?,P,λ is still well-defined and useful
for heavy-tailed conditional distributions of Y given x, for which the first absolute
moment EP|Y | is infinite. In particular, these results show that even in the case of
heavy-tailed distributions, the forecasts
fL?,D,λn(x) = fL,D,λn
(x)
are consistent and robust with respect to the influence function and the maxbias, if a
bounded, continuous kernel and a convex, Lipschitz continuous loss function are used.
8
In this respect, the combination of the Gaussian RBF kernel with the ε-insensitive loss
function or Huber’s loss function for regression purposes or with the pinball loss for
quantile regression yields SVMs with good consistency and robustness properties.
Let us now briefly summarize these results. First of all, there exists a strong re-
lationship between the loss L and its shifted version L? in terms of convexity and
Lipschitz continuity. If L is Lipschitz continuous, L? is also Lipschitz continuous, and
both Lipschitz constants are equal, i.e., |L|1 = |L?|1, and if L is (strictly) convex, then
L? is (strictly) convex. This yields for a convex loss L that the mapping
f 7→ RL?,P(f) + λ ‖f‖2H , f ∈ H ,
is a strictly convex function because it is the sum of the convex risk functional RL?,P
and the strictly convex mapping f 7→ λ ‖f‖2H.
Furthermore, for L being a Lipschitz continuous loss, bounds for the risks RL?,P(f)
and RregL?,P,λ(f), for all f ∈ H, as well as useful upper bounds for
‚‚fL?,P,λ
‚‚2H have
been obtained.
Let us now assume that the kernel k is additionally bounded. Then both the supre-
mum norm‚‚fL?,P,λ
‚‚∞ and the risk RL?,P(fL?,P,λ) are finite. If the partial Frechet-
and Bouligand-derivatives2 of L and L? exist for (x, y) ∈ X × Y, then, for all t ∈ R,
∇F3 L?(x, y, t) = ∇F
3 L(x, y, t) and ∇B3 L?(x, y, t) = ∇B
3 L(x, y, t) . (7)
Furthermore, for L a Lipschitz continuous loss function, the mapping
RL?,P : L∞(PX) → R
is well-defined and continuous. As a consequence, the function f 7→ RregL?,P,λ(f) is
continuous, since both mappings f 7→ RL?,P(f) and f 7→ λ ‖f‖2H are continuous. If on
the other hand L is a (strictly) convex loss, then
RL?,P : H → [−∞,∞]
is (strictly) convex and
RregL?,P,λ : H → [−∞,∞]
is strictly convex. Also the Lipschitz continuity of L? can be related to the Lipschitz
continuity of its risk. For L a Lipschitz continuous loss and P a probability measure
on X × Y it has been shown that
|RL?,P(f)−RL?,P(g)| ≤ |L|1 · ‖f − g‖L1(PX) , f, g ∈ L∞(PX) .
The fact that for L a Lipschitz continuous loss we have RL?,P(f) /∈ {−∞, +∞} for
all f ∈ L1(PX) as well as RregL?,P,λ(f) > −∞ for all f ∈ L1(PX) ∩H ensures that the
optimization problem to determine fL?,P,λ is well-posed.
The following properties of support vector machines based on a shifted loss function
were derived under our General Assumption.
(i) For all λ > 0 there exists an SVM fL?,P,λ and it is unique.
2 See Appendix A.1
9
(ii) If L is even Frechet-differentiable, then the mapping
g : (X × Y)n → H, g(D) = fL,D,λ
is continuous. Therefore, the solution depends in a continuous manner on the input
data. The uniqueness and existence of the solution, together with the continuity
of this mapping ensures that the SVM is the solution of a well-posed problem in
Hadamard’s sense.
(iii) Assume that the RKHS H is dense in L1(µ) for all distributions µ on X . Then
SVMs are consistent in the sense
RL?,P(fL?,D,λn) −→P R∗L?,P , n →∞, (8)
i.e., the L?-risk of the SVM fL?,D,λnconverges to the Bayes risk which is defined
as the smallest possible risk over all measurable functions f : X → R. That the
universal consistency in (8) is valid for SVMs is somewhat astonishing at first glance
because fL?,D,λnis evaluated by minimizing a regularized empirical risk over the
RKHS H, whereas the Bayes risk is defined as the minimal non-regularized risk
over the broader set of all measurable functions f : X → R. To make this universal
consistency achievable, we made of course the denseness assumption on H.
(iv) If the null-sequence (λn)n∈N of regularization parameters fulfills λ2+δn n → ∞ for
some δ > 0, then the convergence in (8) even holds P∞-almost surely.
(v) In general, it is unclear whether this convergence of the risks also implies the con-
vergence of fL?,D,λnto a minimizer f∗L?, P corresponding to the Bayes risk R∗L?,P.
However, given some weak additional assumptions, this holds true for example for
the pinball loss function.
Before we list some statistical robustness properties of SVMs, we first recall the
definition of Hampel’s influence function which is a cornerstone of robust statistics.
Define the function T : M1(X × Y) → H as T (P) := fL?,P,λ. For M1(Z) the set
of distributions on some measurable space (Z,B(Z)) and H an RKHS, the influence
function (Hampel, 1968, 1974) of T : M1(Z) → H at z ∈ Z for a distribution P ∈M1(Z) is defined as
IF(z; T, P) = limε↓0
T ((1− ε)P + εδz)− T (P)
ε,
if the limit exists. In this approach of robust statistics, a statistical method T (P) is
called robust if its influence function (IF) is bounded.
(vi) The bias of the SVM increases at most linearly with the radius ε ∈ [0, 1] of a
mixture contamination neighbourhood around P. More precisely, it holds‚‚‚fL?,(1−ε)P+εQ,λ − fL?,P,λ
‚‚‚∞ ≤ ‖k‖∞‚‚‚fL?,(1−ε)P+εQ,λ − fL?,P,λ
‚‚‚H ≤ c · ε ,
where c = c(L?, k) is some known finite constant and Q a distribution that lies in
the mixture contamination neighbourhood.
(vii) The influence function IF(z; T, P) for SVMs based on a convex and Lipschitz con-
tinuous shifted loss function exists and is bounded if the first two partial Frechet-
derivatives of the loss function L are continuous and bounded. These results hold
for all probability measures P on X ×Y and for all z := (x, y) ∈ X ×Y. It is easily
verifiable that the assumptions on L are fulfilled for, e.g., the logistic loss function
for regression in combination with the Gaussian RBF kernel.
10
Unfortunately, the previously mentioned conditions on the existence of partial
Frechet-derivatives of the loss function are not fulfilled for some losses that are often
used in practice, such as the ε-insensitive loss or the pinball loss. In the next section
we will see how this gap can be filled.
3 Bouligand influence function for SVMs
Recently, the Bouligand influence function (BIF) was introduced as a new measure in
robust statistics, see Christmann and Van Messem (2008). The Bouligand influence
function is in particular useful to study robustness properties of statistical functionals
which are defined as minimizers of non-Frechet-differentiable objective functions. A
special case are support vector machines based on the ε-insensitive loss or on the
pinball loss. The Bouligand influence function of the mapping T : M1(X × Y) → Hfor a distribution P in the direction of a distribution Q 6= P is the special Bouligand-
derivative3 (if it exists)
limε↓0
‚‚T `(1−ε)P+εQ´−T (P)− BIF(Q; T, P)
‚‚Hε
= 0 .
The BIF thus measures the impact of an infinitesimal small amount of contamination
of the original distribution P in the direction of Q on the quantity of interest T (P).
Therefore, a bounded BIF of the function T is desirable.
The only new theoretical result we will prove in this review paper is the following
result about the robustness of SVMs. The theorem states that, given some conditions,
support vector machines are robust in the sense of the Bouligand influence function,
even when they are based on heavy-tailed distributions.
Theorem 2 (Bouligand influence function) Let X be a complete separable
normed linear space4 and H be an RKHS of a bounded, continuous kernel k. Let L
be a convex, Lipschitz continuous loss function with Lipschitz constant |L|1 ∈ (0,∞).
Let the partial Bouligand-derivatives ∇B3 L(x, y, · ) and ∇B
3,3L(x, y, · ) be measurable
and bounded by
κ1 := sup(x,y)
‚‚‚∇B3 L(x, y, · )
‚‚‚∞∈ (0,∞) , κ2 := sup(x,y)
‚‚‚∇B3,3L(x, y, · )
‚‚‚∞< ∞ . (9)
Let P and Q 6= P be probability measures on X × Y, δ1 > 0, δ2 > 0,
Nδ1(fL?,P,λ) := {f ∈ H :‚‚f − fL?,P,λ
‚‚H < δ1} ,
and λ > κ22 ‖k‖3∞. Define G : (−δ2, δ2)×Nδ1(fL?,P,λ) → H,
G(ε, f) := 2λf + E(1−ε)P+εQ∇B3 L?(X, Y, f(X))Φ(X) , (10)
and assume that ∇B2 G(0, fL?,P,λ) is strong. Then the Bouligand influence function
BIF(Q; T, P) of T (P) := fL?,P,λ exists, is bounded, and equals
S−1`EP∇B3 L?(X, Y, fL?,P,λ(X))Φ(X)
´− S−1`EQ∇B3 L?(X, Y, fL?,P,λ(X))Φ(X)
´,
(11)
3 See Appendix A.14 E.g., X ⊂ Rd closed. By definition of the Bouligand-derivative, X has to be a normed
linear space.
11
where S := ∇B2 G(0, fL?,P,λ) : H → H is given by
S( · ) = 2λ idH( · ) + EP∇B3,3L?(X, Y, fL?,P,λ(X))〈Φ(X), · 〉HΦ(X) .
Note that the Bouligand influence function of the SVM only depends on Q via the
second term in (11). For the ε-insensitive loss we have (κ1, κ2) = (1, 0) and for the
pinball loss (κ1, κ2) = (max{1− τ, τ}, 0).
4 Numerical considerations
To illustrate our robustness result, we include two numerical examples. The first exam-
ple treats some simulated data, based upon a Cauchy distribution, the second (real-life)
example is an insurance data set with extreme values. For both examples we used R (R
Development Core Team, 2009) for the numerical calculations.
4.1 Numerical example for simulated data
In the first example we try to predict the function f(x) = 50 sin(x/20) cos(x/10) + x
with an SVM. For this purpose, n = 1000 data points xi from a uniform distribution
U(−100, 100) have been generated. The corresponding output values yi were generated
by yi = f(xi)+εi, where εi were pseudo-random numbers from a Cauchy distribution.
The SVM-regression was done by using the ε-insensitive loss function and the Gaussian
RBF kernel as defined in the introduction. The computation was executed via the R
function svm from the library e1071. The hyperparameters (λ, ε, γ) of the SVM have
been determined by minimizing the L?-risk via a three dimensional grid search over
17 × 12 × 17 = 3468 knots, where λ is the regularization parameter of the SVM, ε is
the parameter of the ε-insensitive loss function, and γ is the parameter of the Gaussian
RBF kernel. The grid search resulted in the choice (λ, ε, γ) =`2−9/n, 2−8, 2−4´.
Figure 2 shows the fitted SVM. The upper sub-plot clearly shows that there are
extreme values in the data set which was intended because we simulated error terms
from a Cauchy distribution, but the SVM fits nevertheless the pattern set by the
majority of the data points quite well. Due to these extreme values, a relatively small
systematic bias of the SVM would not be visible. Therefore, we zoomed in on the
interesting part of the y-axis to get a better view of the plot, as shown in the lower
sub-plot of Figure 2. In this graph we see that there is almost no bias and that the
SVM neatly covers the true function. Hence, this numerical example confirms our
theoretical robustness result of a bounded Bouligand influence function: the SVM is a
good approximation of the true function even for heavy-tailed distributions with large
extreme values, if the sample size is large enough.
4.2 Numerical example for large fire insurance claims in Denmark
For the second example we used the data set ‘danish’ from the R package evir. This set
consists of 2167 fire insurance claims over 1 million Danish Krone (DKK) during the pe-
riod from Thursday, January 3rd, 1980 until Monday, December 31st, 1990. The claims
are total figures, i.e., they include damage to buildings, furniture and personal property
as well as loss of profits. The data were supplied by Mette Rytgaard of Copenhagen Re.
12
−100 −50 0 50 100
−10
00−
500
0
x
y
<−−
−100 −50 0 50 100
−15
0−
5050
150
x
y
Fig. 2 The upper sub-plot shows all data points, including all extreme values, as well asthe true function and SVM. The difference between both functions is hardly visible. The lowersub-plot zooms in on the y-axis. It shows that the bias between the SVM and the true functionis almost invisible in this example despite the extreme values.
These data form an irregular time series. The plot of the data shows that there really is
a time effect which we will use for our SVM purposes. We have done both classical least
squares regression as well as nonparametric conditional quantile regression by means of
SVMs. Time was the only explanatory variable. The SVM-regression was done by using
the pinball loss for different τ -values. We chose τ ∈ {0.50, 0.75, 0.90, 0.99, 0.995} since
we are interested in big claims. We used the Gaussian RBF kernel. The computations
were done with the function kqr from the package kernlab (Karatzoglou et al., 2004)
in R.
Figure 3 shows the fitted curves for the Danish data set. The least squares fit is
shown as a dotted line, the SVM-quantiles are drawn as solid curves. The upper sub-
plot shows the data with 3 remarkable large extreme claims, these being the claims
of over 100 million DKK. However, the SVM-quantiles appear to follow the bulk of
the points, and do not seem to be much attracted by these extremes. This is in good
agreement to our theoretical result stating that SVMs are robust. Due to the extremes,
the conditional quantile curves fitted by SVMs and the LS-fit are hardly distinguishable
in the upper plot. Therefore we zoomed in on the y-axis to get a better view. On
13
the second sub-plot, we can clearly see all curves, but no longer the extreme data
points. The third sub-plot zooms even further, showing only the 90%-quantile and
lower quantiles, as well as the least squares regression line. We see that the LS fit lies
above the 50% SVM-quantile curve (and in this case often even above the 75% SVM-
quantile curve) because it is more attracted towards the extreme values and hence is
less robust.
The Danish data set is well-known in the literature on extreme value theory and
is therefore used as a benchmark data set in the R package evir developed by Alexan-
der McNeil and Alec Stephenson. To demonstrate that the residuals of an SVM for
nonparametric median regression based on the pinball loss function with τ = 0.5 have
a distribution close to an extreme value distribution, we fitted a generalized Pareto
(GPD) distribution by the maximum likelihood method to the residuals, see Pickands
(1975) and Hosking and Wallis (1989) for details. The computations were done with the
function gpd from the evir package. Figure 4 shows that a GPD distribution, which
is well-known to have heavy tails, actually offers a good fit for these residuals. The
ML estimates for the two parameters of the GPD distribution are 0.498 (0.149) for the
shape parameter and 7.824 (1.377) for the scale parameter.
5 Discussion
Support vector machines are successful statistical methods for classification and re-
gression purposes to analyze even large data sets with unknown, complex, and high-
dimensional dependency structures. Recently, consistency and robustness results for
SVMs were derived for unbounded output spaces, given that the moment condition
EP|Y | < ∞ is fulfilled. However, this assumption excludes heavy-tailed distributions
such as some stable distributions, including the Cauchy distribution, as well as many
extreme value distributions which occur in financial or actuarial problems.
By shifting the loss function, it is possible to enlarge the applicability of support
vector machines to situations where the output space Y is unbounded, e.g., Y = R or
Y = [0,∞), without the need for the above mentioned moment condition on the con-
ditional distribution of Y given x. Under rather weak conditions which do not depend
on the data or on the actual distribution P, and are thus easy to verify, support vector
machines based on shifted loss functions exist, are unique, depend in a continuous man-
ner on the data set, are risk consistent and have good statistical robustness properties.
These conditions are satisfied, if the researcher uses a convex Lipschitz continuous loss
function, a bounded continuous kernel with a dense reproducing kernel Hilbert space,
and a null-sequence of positive regularization parameters λn such that λ2nn converges
to infinity. Even for regression purposes with an unbounded output space, support vec-
tor machines are robust with respect to many statistical aspects even for heavy-tailed
distributions like extreme value distributions or Cauchy distributions.
Appendix: Mathematical facts and proofs
A.1 Some facts on Bouligand-derivatives
We recall some facts on Bouligand-derivatives and strong approximation of functions, becausethese notions are used to investigate robustness properties for SVMs for nonsmooth loss func-tions in Theorem 2. Let E1, E2, W , and Z be normed linear spaces, and let us consider
14
050
100
150
200
250
dani
sh
1980 1982 1984 1986 1988 1990
<−−
−−> −−>
tau = 0.99tau = 0.995
010
2030
40
dani
sh
1980 1982 1984 1986 1988 1990
LS
tau = 0.50
tau = 0.75tau = 0.90
tau = 0.99
tau = 0.995
02
46
810
dani
sh
1980 1982 1984 1986 1988 1990
LS
tau = 0.50
tau = 0.75
tau = 0.90
Fig. 3 All data points, including the extreme values, are visible in the upper sub-plot. TheSVM-quantiles seem to follow the mass of the data points and are not attracted to the extremes.The lower sub-plots zoom in on different scales of the y-axis. Here both the LS regression (dot-ted line) and the SVM-quantiles are distinguishable. There seems to be almost no attractiontowards the extreme values for the SVM based quantile curves.
15
10 20 50 100 200
0.0
0.4
0.8
x (on log scale)
Fu(
x−u)
10 20 50 100 200
5e−
051e
−03
2e−
02
x (on log scale)
1−F
(x)
(on
log
scal
e)
0 20 40 60 80
01
23
45
Ordering
Res
idua
ls
0 1 2 3 4 5
01
23
45
Ordered Data
Exp
onen
tial Q
uant
iles
Fig. 4 Plots to check whether the residuals of the median regression based on SVMs with thepinball loss function have an approximate generalized Pareto distribution. Upper left: excessdistribution; Upper right: tail of the underlying distribution; Lower left: scatterplot of theresiduals; Lower right: qqplot of the residuals.
neighbourhoods N (x0) of x0 in E1, N (y0) of y0 in E2, and N (w0) of w0 in W . Let F and Gbe functions from N (x0) × N (y0) to Z, h1 and h2 functions from N (w0) to Z, f a functionfrom N (x0) to Z and g a function from N (y0) to Z. A function f approximates F in x at(x0, y0), written as f ∼x F at (x0, y0), if
F (x, y0)− f(x) = o(x− x0) .
Similarly, g ∼y F at (x0, y0) if F (x0, y)−g(y) = o(y−y0). A function h1 strongly approximatesh2 at w0, written as h1 ≈ h2 at w0, if for each ε > 0 there exists a neighbourhood N (w0) ofw0 such that whenever w and w′ belong to N (w0),
‚‚`h1(w)− h2(w)´− `h1(w′)− h2(w′)
´‚‚ ≤ ε‚‚w − w′
‚‚ .
A function f strongly approximates F in x at (x0, y0), written as f ≈x F at (x0, y0), if foreach ε > 0 there exist neighbourhoods N (x0) of x0 and N (y0) of y0 such that whenever x andx′ belong to N (x0) and y belongs to N (y0) we have
‚‚`F (x, y)− f(x)´− `F (x′, y)− f(x′)
´‚‚ ≤ ε‚‚x− x′
‚‚ .
Strong approximation amounts to requiring h1−h2 to have a strong Frechet-derivative of 0 atw0, though neither h1 nor h2 is assumed to be differentiable in any sense. A similar definitionis made for strong approximation in y. We define strong approximation for functions of severalgroups of variables, for example G ≈(x,y) F at (x0, y0), by replacing W by E1 × E2 andmaking the obvious substitutions. Note that one has both f ≈x F and g ≈y F at (x0, y0)exactly if f(x) + g(y) ≈(x,y) F at (x0, y0).
Recall that a function f : E1 → Z is called positive homogeneous if
f(αx) = αf(x) ∀α ≥ 0, ∀x ∈ E1 .
16
Following Robinson (1987) we can now define the Bouligand-derivative. Given a functionf from an open subset U of a normed linear space E1 into another normed linear space Z, wesay that f is Bouligand-differentiable at a point x0 ∈ U , if there exists a positive homogeneousfunction ∇Bf(x0) : U → Z such that f(x0 + h) = f(x0) +∇Bf(x0)(h) + o(h), which can berewritten as
limh→0
‚‚f(x0 + h)− f(x0)−∇Bf(x0)(h)‚‚
Z
‖h‖E1
= 0 .
Let F : E1 × E2 → Z, and suppose that F has a partial Bouligand-derivative5 ∇B1 F (x0, y0)
with respect to x at (x0, y0). We say ∇B1 F (x0, y0) is strong if
F (x0, y0) +∇B1 F (x0, y0)(x− x0) ≈x F at (x0, y0) .
If the Frechet-derivative exists, then the Bouligand-derivative exists and both are equal, butthe converse is in general not true.
A.2 Proof of Theorem 2
The proof of Theorem 2 is based upon the following implicit function theorem for Bouligand-derivatives, see Robinson (1991, Cor. 3.4). For a function f from a metric space (E1, dE1 ) toanother metric space (E2, dE2 ), we define
δ(f, E1) = inf{dE2
`f(x1), f(x2)
´/ dE1 (x1, x2) | x1 6= x2; x1, x2 ∈ E1} .
Theorem 3 Let E2 be a Banach space and E1 and Z be normed linear spaces. Let x0 and y0
be points of E1 and E2, respectively, and let N (x0) be a neighborhood of x0 and N (y0) be aneighborhood of y0. Suppose that G is a function from N (x0)×N (y0) to Z with G(x0, y0) = 0.In particular, for some φ and each y ∈ N (y0), G( · , y) is assumed to be Lipschitz continuouson N (x0) with modulus φ. Assume that G has partial Bouligand-derivatives with respect tox and y at (x0, y0), and that: (i) ∇B
2 G(x0, y0)( · ) is strong. (ii) ∇B2 G(x0, y0)(y − y0) lies in
a neighborhood of 0 ∈ Z, ∀y ∈ N (y0). (iii) δ(∇B2 G(x0, y0),N (y0) − y0) =: d0 > 0. Then
for each ξ > d−10 φ there are neighborhoods U of x0 and V of y0, and a function f∗ : U → V
satisfying (a) f∗(x0) = y0. (b) f∗ is Lipschitz continuous on N (x0) with modulus ξ. (c)For each x ∈ U , f∗(x) is the unique solution in V of G(x, y) = 0. (d) The function f∗ isBouligand-differentiable at x0 with
∇Bf∗(x0)(u) =“∇B
2 G(x0, y0)”−1 “−∇B
1 G(x0, y0)(u)”
.
The following consequence of the open mapping theorem, see Lax (2002, p. 170), is alsoneeded for the proof of Theorem 2.
Theorem 4 Let E1 and E2 be Banach spaces, A : E1 → E2 be a bounded, linear, and bijectivefunction. Then the inverse A−1 : E2 → E1 is a bounded linear function.
Furthermore, we need the following well-known inequalities
‖f‖∞ ≤ ‖k‖∞ ‖f‖H and ‖Φ(x)‖∞ ≤ ‖k‖∞ ‖Φ(x)‖H ≤ ‖k‖2∞ (12)
for all f ∈ H and all x ∈ X . These follow from the reproducing property and the fact that‖Φ(x)‖H =
pk(x, x).
Remark 5 In order to prove Theorem 2, we will show that, under the same assumptions, wehave:
(i) For some χ and each f ∈ Nδ1 (fL?,P,λ), G( · , f) is Lipschitz continuous on (−δ2, δ2) withLipschitz constant χ.
(ii) G has partial Bouligand-derivatives with respect to ε and f at (0, fL?,P,λ).
5 Partial Bouligand-derivatives of f are denoted by ∇B1 f , ∇B
2 f , ∇B2,2f := ∇B
2
`∇B2 f´
etc.
17
(iii) ∇B2 G(0, fL?,P,λ)(h− fL?,P,λ) lies in a neighborhood of 0 ∈ H, ∀h ∈ Nδ1 (fL?,P,λ).
(iv) d0 := infh1,h2∈Nδ1 (fL,P,λ)−fL,P,λ ; h1 6=h2
‖∇B2 G(0,fL,P,λ)(h1)−∇B
2 G(0,fL,P,λ)(h2)‖H‖h1−h2‖H
> 0.
(v) For each ξ > d−10 χ there exist constants δ3, δ4 > 0, a neighborhood Nδ3 (fL?,P,λ) :=
{f ∈ H;‚‚f − fL?,P,λ
‚‚H < δ3}, and a function f∗ : (−δ4, δ4) → Nδ3 (fL?,P,λ) satisfying
(v.1) f∗(0) = fL?,P,λ.(v.2) f∗ is Lipschitz continuous on (−δ4, δ4) with Lipschitz constant |f∗|1 = ξ.(v.3) For each ε ∈ (−δ4, δ4) is f∗(ε) the unique solution of G(ε, f) = 0 in Nδ3 (fL?,P,λ).
(v.4) ∇Bf∗(0)(u) =`∇B
2 G(0, fL?,P,λ)´−1 `−∇B
1 G(0, fL?,P,λ)(u)´, u ∈ (−δ4, δ4).
The function f∗ is the same as the one in Theorem 3.
The main idea of the proof of Theorem 2 is to define the map G : R × H → H definedby (10). For ε < 0 the integration is w.r.t. a signed measure. The H-valued expectation thatis used in the definition of G is well-defined for all ε ∈ (δ2, δ2) and all f ∈ Nδ1 (fL,P,λ),
since κ1 ∈ (0,∞) by (9) and ‖Φ(x)‖∞ ≤ ‖k‖2∞ < ∞ by (12). Bouligand-derivatives fulfilla chain rule and Frechet-differentiability implies Bouligand-differentiability. For ε ∈ [0, 1] wethus obtain
G(ε, f) =∂Rreg
L?,(1−ε)P+εQ,λ
∂H (f) = ∇B2 Rreg
L?,(1−ε)P+εQ,λ(f) . (13)
The map f 7→ RregL?,(1−ε)P+εQ,λ
(f) is convex and continuous for all ε ∈ [0, 1] and therefore
equation (13) shows that G(ε, f) = 0 if and only if f = fL?,(1−ε)P+εQ,λ for such ε. Hence
G(0, fL?,P,λ) = 0 . (14)
By using the steps in Remark 5, we will show that Theorem 3 is applicable for G and thatthere exists a Bouligand-differentiable function ε 7→ fε defined on a small interval (−δ2, δ2) forsome δ2 > 0 satisfying G(ε, fε) = 0 for all ε ∈ (−δ2, δ2). Due to the existence of this function,we obtain that BIF(Q; T, P) = ∇Bfε(0). Now we can give the proof of our main result.
Proof of Theorem 2. In Section 2 we have seen that fL?,P,λ exists and is unique. The as-sumption that G(0, fL?,P,λ) = 0 is valid by (14). We will now prove the results of Remark 5parts (i) to (v).
Remark 5 part (i). For a fixed f ∈ H, take ε1, ε2 ∈ (−δ2, δ2). Using ‖k‖∞ < ∞ and (14)we obtain
˛E(1−ε1)P+ε1Q∇B
3 L?(X, Y, f(X))Φ(X)− E(1−ε2)P+ε2Q∇B3 L?(X, Y, f(X))Φ(X)
˛
=˛(ε1 − ε2)EQ−P∇B
3 L?(X, Y, f(X))Φ(X)˛
≤ |ε1 − ε2|Z ˛˛∇B
3 L?(x, y, f(x))Φ(x)˛˛ d|Q− P|(x, y)
≤ |ε1 − ε2|Z
sup(x,y)∈(X ,Y)
|∇B3 L?(x, y, f(x))| sup
x∈X|Φ(x)| d|Q− P|(x, y)
≤ |ε1 − ε2| ‖Φ(x)‖∞ sup(x,y)∈(X ,Y)
‚‚‚∇B3 L?(x, y, · )
‚‚‚∞
Zd|Q− P|(x, y)
≤ 2 ‖k‖2∞ sup(x,y)∈(X ,Y)
‚‚‚∇B3 L?(x, y, · )
‚‚‚∞|ε1 − ε2|
= 2 ‖k‖2∞ κ1 |ε1 − ε2| < ∞ .
Remark 5 part (ii). We obtain
∇B1 G(ε, f) = ∇B
1
“E(1−ε)P+εQ∇B
3 L?(X, Y, f(X))Φ(X)”
= ∇B1
“EP∇B
3 L?(X, Y, f(X))Φ(X) + εEQ−P∇B3 L?(X, Y, f(X))Φ(X)
”
= EQ−P∇B3 L?(X, Y, f(X))Φ(X)
= EQ∇B3 L?(X, Y, f(X))Φ(X)− EP∇B
3 L?(X, Y, f(X))Φ(X) . (15)
18
These expectations exists due to (12) and (9). We also have
∇B2 G(0, fL?,P,λ)(h) + o(h)
= G(0, fL?,P,λ + h)−G(0, fL?,P,λ)
= 2λh + EP∇B3 L?(X, Y, (fL?,P,λ(X) + h(X)))Φ(X)− EP∇B
3 L?(X, Y, fL?,P,λ(X))Φ(X)
= 2λh + EP
“∇B
3 L?`X, Y, (fL?,P,λ(X) + h(X))
´−∇B3 L?
`X, Y, fL?,P,λ(X)
´”Φ(X) .
Since the term ∇B3 L?
`X, Y, (fL?,P,λ(X) + h(X))
´−∇B3 L?
`X, Y, fL?,P,λ(X)
´is bounded due
to (12), (9), and ‖k‖∞ < ∞, this expectation exists. Using 〈Φ(x), · 〉H ∈ H for all x ∈ X ,weget
∇B2 G(0, fL?,P,λ)( · ) = 2λidH( · ) + EP∇B
3,3L?(X, Y, fL?,P,λ(X))〈Φ(X), · 〉HΦ(X) . (16)
Note that EP∇B3,3L?
`X, Y, f(X)
´= ∇B
3 EP∇B3 L?
`X, Y, f(X)
´, because by using the defi-
nition of the Bouligand-derivative, we obtain for the partial Bouligand-derivatives
∇B3 EP∇B
3 L?`X, Y, f(X)
´− EP∇B3,3L?
`X, Y, f(X)
´
= EP
`∇B3 L?(X, Y, (f(X) + h(X)))−∇B
3 L?(X, Y, f(X))´− EP∇B
3,3L?(X, Y, f(X)) + o(h)
= EP
`∇B3 L?(X, Y, (f(X) + h(X)))−∇B
3 L?(X, Y, f(X))−∇B3,3L?(X, Y, f(X))
´+ o(h)
= o(h) , h ∈ H .
Remark 5 part (iii). LetNδ1 (fL?,P,λ) be a δ1-neighborhood of fL?,P,λ. SinceH is an RKHS
and therefore a vector space, we have for all h ∈ Nδ1 (fL?,P,λ) that‚‚fL?,P,λ − h− 0
‚‚H ≤ δ1
and thus h − fL?,P,λ ∈ Nδ1 (0) ⊂ H. Note that ∇B2 G(0, fL?,P,λ)( · ) as in (16) is a mapping
from H to H. For ξ := h − fL?,P,λ we have that ‖ξ‖H ≤ δ1 and the reproducing propertyyields
∇B2 G(0, fL?,P,λ)(ξ) = 2λξ + EP∇B
3,3L?(X, Y, fL?,P,λ(X))ξΦ(X) .
By using (12) and (9) we get
‚‚‚2λξ + EP∇B3,3L?(X, Y, fL?,P,λ(X))ξΦ(X)− 0
‚‚‚H
≤ 2λ ‖ξ‖H +‚‚‚EP∇B
3,3L?(X, Y, fL?,P,λ(X))ξΦ(X)‚‚‚H
≤ 2λ ‖ξ‖H + sup(x,y)∈(X ,Y)
‚‚‚∇B3,3L?(x, y, · )
‚‚‚∞‖ξ‖∞ ‖Φ(x)‖∞
≤ 2λ ‖ξ‖H + κ2 ‖ξ‖H ‖k‖3∞≤ `2λ + κ2 ‖k‖3∞
´δ1 ,
and thus ∇B2 G(0, fL?,P,λ)(h−fL?,P,λ) is in a neighborhood of 0 ∈ H, for all h ∈ Nδ1 (fL?,P,λ).
Remark 5 part (iv). Due to (16) we have to prove that
d0 := inff1 6=f2
‚‚‚2λ(f1 − f2) + EP∇B3,3L?
`X, Y, fL?,P,λ(X)
´(f1 − f2)Φ(X)
‚‚‚H
‖f1 − f2‖H> 0 .
For f1 6= f2, the assumption λ > 12
κ2 ‖k‖3∞, (12), and (9) yield that
‚‚‚2λ(f1 − f2) + EP∇B3,3L?
`X, Y, fL?,P,λ(X)
´(f1 − f2)Φ(X)
‚‚‚H
/ ‖f1 − f2‖H≥ `‖2λ(f1 − f2)‖H −
‚‚‚EP∇B3,3L?(X, Y, fL?,P,λ(X))(f1 − f2)Φ(X)
‚‚‚H´/ ‖f1 − f2‖H
≥ 2λ− κ2 ‖k‖3∞ > 0 ,
which gives the assertion.Remark 5 part (v). The assumptions of Robinson’s implicit function theorem, see Theorem
3, are valid for G due to the results of Remark 5 parts (i) to (iv) and the assumption that∇B
2 G(0, fL?,P,λ) is strong. This gives part (v).
19
The result of Theorem 2 now follows from inserting (15) and (16) into Remark 5 part (v.4).Due to (9) and (7) it is clear that S is bounded. The linearity of S follows from its definitionand the inverse of S does exist by Theorem 3. If necessary we can restrict the range of S toS(H) to obtain a bijective function S∗ : H → S(H) with S∗(f) = S(f) for all f ∈ H. Theorem4 then gives us that S−1 is also bounded and linear. This yields the existence of a boundedBIF specified by (11). ¤
References
Christmann, A. and Steinwart, I. (2004). On robust properties of convex risk minimizationmethods for pattern recognition. Journal of Machine Learning Research, 5, 1007–1034.
Christmann, A. and Steinwart, I. (2007). Consistency and robustness of kernel basedregression in convex minimization. Bernoulli , 13, 799–819.
Christmann, A. and Steinwart, I. (2008). Consistency of kernel based quantile regression.Appl. Stoch. Models Bus. Ind., 24, 171–183.
Christmann, A. and Van Messem, A. (2008). Bouligand Derivatives and Robustness ofSupport Vector Machines for Regression. Journal of Machine Learning Research, 9, 915–936.
Christmann, A., Van Messem, A., and Steinwart, I. (2009). On Consistency and Robust-ness Properties of Support Vector Machines for Heavy-Tailed Distributions. Statistics andIts Interface, 2, 311–327.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines.Cambridge University Press, Cambridge.
Hampel, F. R. (1968). Contributions to the theory of robust estimation. Unpublished Ph.D.thesis, Dept. of Statistics, University of California, Berkeley.
Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of theAmerican Statistical Association, 69, 383–393.
Hosking, J. and Wallis, J. (1989). Parameter and quantile estimation for the generalizedpareto distribution. Technometrics, (29), 339–349.
Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandardconditions. Proceedings of the 5th Berkeley Symposium, 1, 221–233.
Karatzoglou, A., Smola, A., Hornik, K., and Zeileis, A. (2004). kernlab – an S4 packagefor kernel methods in R. Journal of Statistical Software, 11(9), 1–20.
Koenker, R. (2005). Quantile Regression. Cambridge University Press, New York.Lax, P. D. (2002). Functional Analysis. Wiley & Sons, New York.Pickands, J. (1975). Statistical inference using extreme order statistics. Ann. Statist., (3),
119–131.R Development Core Team (2009). R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.Robinson, S. M. (1987). Local structure of feasible sets in nonlinear programming, Part III:
Stability and sensitivity. Mathematical Programming Study, 30, 45–66.Robinson, S. M. (1991). An implicit-function theorem for a class of nonsmooth functions.
Mathematics of Operations Research, 16, 292–309.Scholkopf, B. and Smola, A. J. (2002). Learning with Kernels. Support Vector Machines,
Regularization, Optimization, and Beyond . MIT Press, Cambridge, Massachusetts.Steinwart, I. and Christmann, A. (2008a). How SVMs can estimate quantiles and the
median. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in NeuralInformation Processing Systems 20 . MIT Press, Cambridge, Massachusetts.
Steinwart, I. and Christmann, A. (2008b). Support Vector Machines. Springer, New York.Vapnik, V. N. (1998). Statistical Learning Theory. Wiley & Sons, New York.