Estimation by Fitting Generalized Moments

ESTIMATION BY FITTING GENERALIZED MOMENTS

JOHANNES LEITNER AND GRIGORY TEMNOV

Abstract. We propose a new estimation method by minimizing a centralized and nor-

malized distance based on generalized (exponential) moments. An asymptotic sensitivity

analysis shows that the proposed estimator belongs to the class of minimal contrast estima-

tors. The approach can be used for model selection problems. As an example, we apply the

estimator to generalized Pareto distributed operational risk data.

Keywords: Higher order minimum contrast estimator, goodness-of-fit, operational risk,

generalized Pareto distribution.

MSC: 62F10, 62F12, 62-07.

1. Introduction

In this paper, we consider a generalized moments goodness-of-fit type of approach. As

a special case, we propose to work with generalized (exponential) moments in addition to

using standard (higher order) moments, leading to a goodness-of-fit criterion of the Laplace

transform based on fitting points on a grid as well as fitting (higher order) derivatives at zero.

The main idea is simple: Instead of estimating a parameter by minimizing a distance measure,

we argue that one should rather estimate by minimizing the p-value of the observed distance,

since a smaller distance does not necessarily mean a better p-value. The latter being typically

Date: April 12, 2015.

Johannes Leitner: Research Unit for Financial and Actuarial Mathematics, Institute for Mathem. Eco-

nomics, Vienna University of Technology, Wiedner Hauptstraße 8-10/105-1, A–1040 Vienna, Austria.

Grigory Temnov: School of Mathematical Sciences, University College Cork. Aras na Laoi, UCC Main

Campus, Cork, Co. Cork, Ireland.

Emails: [email protected], [email protected] .

Corresponding author is G.Temnov (phone: +353-0861659315).

This work was financially supported by the Christian Doppler Research Association (CDG). The authors

gratefully acknowledge a fruitful collaboration and continued support by Bank Austria through CDG.

1

2 JOHANNES LEITNER AND GRIGORY TEMNOV

computationally too expensive, we propose to minimize a centralized and normalized distance

based on generalized (exponential) moments.

The present paper has several objectives and consists of two parts that, to some extend,

can be read independently. In the theoretical part we argue that the use of MLE and classical

moment fit estimators is not satisfactorily justified for small sample sizes and we propose a

new type of estimator that also turns out to be asymptotically efficient. In our opinion it is not

enough to demonstrate asymptotic efficiency of an estimator. Notoriously too small sample

sizes force us to check performance of almost any estimator for realistic sample sizes case-by-

case (general results seem out of reach). In the applied part of the paper we show how to

use our approach in the case of generalized Pareto distributions and demonstrate its superior

performance for small sample sizes compared to MLE and classical moment fit estimators. As

performance measures we look at bias, variance and quantiles of estimators. Given a model,

our approach can be used for parametric estimation. Given a nested sequence of models, our

approach also contains a discrimination feature allowing, in addition to parametric estimation,

selection of the most parsimonious model that can not be rejected at a given confidence level.

We start by discussing the connection of Maximal Entropy, Maximum Likelihood Estima-

tion principles and various information criteria in Section 2. Our line of argument leads to a

generalized moment goodness-of-fit approach, which is considered in Section 3. In Section 4 a

sensitivity analysis provides a bound on the asymptotic precision of the proposed estimator,

which turns out to be a minimum contrast estimator.

It is beyond the scope of the present paper to study a nested sequence of models or to apply

the approach in a high dimensional setting and carry out a full statistical analysis in these

situations. However, as an example and proof of concept, in Subsection 5.2 we apply our

approach to heavy-tailed (Generalized Pareto distributed) data from operational risk. The

topic of model selection is discussed in Subsection 5.3.

ESTIMATION BY FITTING GENERALIZED MOMENTS 3

2. Maximal Entropy and Maximum Likelihood Estimation

Before proposing our generalized moment fit estimation approach in Section 3, let us start

by providing some motivation by looking at some popular alternative estimation methods.

One intensely studied estimation approach is the principle of maximum entropy (MAX-

ENT), going back to work by Jaynes (1957) (or even Boltzmann and Gibbs) in statistical

physics. See also Jaynes (2003) and e.g. Kapur and Kesavan (1992) and Kapur (1994) for

applications. Despite its wide applicability, MAXENT is still controversial. See e.g. Jaynes

(1979), Jaynes (1981), Jaynes (1982), Grendar and Grendar (2001) and Uffink (1995) for

discussions of MAXENT. Furthermore, in some cases it is known that there does not exist

a density in exponential form satisfying the desired moment constraints, see e.g. Wragg and

Dowson (1970), making it impossible to apply MAXENT.

It is a key assumption of MAXENT that (generalized) moments of the unknown distribution

are already known and the idea is to maximize entropy under these moment constraints. This

can realistically be assumed for many statistical mechanics models (as well as the validity

of asymptotic approximations if the number of particles is large enough), where e.g. the

temperature can be observed by an (arbitrarily often repeatable) experiment. In contrast,

for non-Bayesian statistical problems, where there is typically only one set of observed data

available, the moments are not known ! It is tempting to use observed empirical (generalized)

means for the constraints instead, so that one can apply MAXENT. However, it is not clear

why one should restrict the model class from which to choose the to be estimated model to

just those models that have theoretical generalized means equal to the observed empirical

ones (which are necessarily subject to statistical fluctuations). See also Uffink (1996) for the

problem of choosing constraints for MAXENT.

On the other hand it is also well known, that using empirical means as constraints for MAX-

ENT corresponds to maximum likelihood estimation (MLE) w.r.t. a certain parameterization,

see e.g. Good (1963), Theorem 4 or Campbell (1970).

If one accepts the MLE principle, as a next step one has to decide on a parameterization

(or equivalently which (generalized) empirical moments to use as constraints for MAXENT).


Given a nested sequence of parameterizations, the maximum likelihood value will necessarily

not decrease as the number of parameters increases and typically, introducing more parame-

ters will strictly increase likelihood. If for a finite probability space the generalized moments

eventually form a complete orthogonal system, then the distribution estimated by MLE or

MAXENT will just equal the empirical distribution. It has been observed by Akaike (1973),

that the log-likelihood for an observed sample provides an estimator for the entropy. Unfor-

tunately this estimator is biased, and increasing the number of parameters increases the bias

(see bellow).

The problem is that there seems not to be a natural way to choose the number of parameters

or the approximation order. This is similar as for Fourier approximation approaches, where

it is known that the order of approximation must not be chosen too high, see Crain (1973)

and Kronmal and Tarter (1968).

In the next subsection we are going to consider the MAXENT and MLE approach for

exponential families in more detail.

2.1. Exponential Families. Let P be a Borel probability measure on Rn, n ≥ 1, and let

G = (Gj)1≤j≤m : Rn → Rm,m ≥ 1, be a measurable function. Assume that the set A of

all α ∈ Rm, such that exp(−α ·G) is integrable w.r.t. P , has a non-empty interior (α ·G :=∑m

j=1 αjGj). (Clearly A is always convex and contains the origin, but 0 ∈ ∂A is not excluded.)

In order to exclude trivial cases, assume also that α ·G is P -a.s. constant iff α = 0 (otherwise

we could reduce the number of dimensions after a change of coordinate system).

For α ∈ A define a Borel probability measure Pα on Rn by

(1)dPα

dP= exp

(− Z(α)− α ·G),

where Z(α) := log( ∫

exp(−α · G)dP)

is a normalizing constant. Denote the expectation

operator w.r.t. Pα by Eα[·], so that E[·] = E0[·], and similarly for the (co-)variance.

Denote by Lα := LαG the Laplace transform of G w.r.t. Pα and set L := LG := L0

G. We

have 1 = E[

dPαdP

]= exp

( − Z(α))E[exp(−α · G)] = exp

( − Z(α))L(α), i.e. the normalizing

function Z equals log(L) on A. Furthermore, note that for α, γ ∈ Rm with α, α + γ ∈ A we


have

Lα(γ) = Eα

[exp(−γ ·G)

]= E

[exp

(− Z(α)− (α + γ) ·G)]

= exp(− Z(α)

)L0(α + γ) =L0(α + γ)L0(α)

.

For all α ∈ A, we have

(2) ∇Z(α) = −∫

G exp(−α ·G)dP∫exp(−α ·G)dP

= −∫

G exp(− Z(α)− α ·G)

dP = −Eα[G],

and

IZ(α) :=(

∂2Z

∂αj ∂αk(α)

)

1≤j,k≤m

=(∫

GjGk exp(− Z(α)− α ·G)

dP − ∂Z

∂αj(α)

∂Z

∂αk(α)

)

1≤j,k≤m

=(Covα[Gj , Gk]

)1≤j,k≤m

,

implying Z to be a convex function since the covariation matrix of G is always positive

semi-definite. Note also that IZ(α) equals the Fisher information matrix of Pα.

Let ω = (ωi)1≤i≤N be an i.i.d. sample drawn under the law Pα for some α ∈ A. Define

G = (Gj)1≤j≤m :=(N−1

∑Ni=1 Gj(ωi)

)1≤j≤m

, i.e. G describes the empirical generalized

moments corresponding to G.

Assume now that there exists a (unique) maximum likelihood estimator (MLE) α ∈ A for

ω. Denoting the log-likelihood function by

(3) LL(α) := LL(α, ω) := N−1 log

(N∏

i=1

dPα

dP(ωi)

)= −Z(α)− α · G,

this is equivalent to the existence of a (unique) α ∈ A (A open) with

(4) −∇Z(α) = G,

i.e. α ∈ A is a MLE for ω iff empirical generalized moments equal implied theoretical ones:

(5) Eα[G] =∫

G dPα = G.


On the other hand we find for the relative entropy of Pα w.r.t. P

(6) H(Pα|P ) :=∫

dPα

dPlog

(dPα

dP

)dP =

∫ (− Z(α)− α ·G)dPα = −Z(α)− α · Eα[G].

Clearly, maximizing H(Pα|P ) under the (generalized) moment constraints Eα[G] = G leads

to the same estimator(s) α as the MLE approach.

However, having estimated α by MAXENT or MLE, the theoretical covariance matrix of

G is given by IZ(α) and it is not guaranteed that IZ(α) equals approximately the empirical

covariance matrix calculated from ω. Since under regularity conditions (e.g. Carleman’s

condition) a distribution is uniquely characterized by its moments, one could naively think

that the more moments of two distributions coincide, the more similar the distributions

are, and in this logic it is tempting to extend the model by including moment constraints

on G = (GiGj)1≤i≤j≤m and apply MAXENT to (G, G). Provided that solutions exist to

the minimal relative entropy problems, this procedure can be repeated as often as desired.

Unfortunately it is not guaranteed that this will always lead to improved estimators. First

of all, there is no guarantee that the higher moments of G not yet included in G are fitted

well. For example, in the case of contingency tables the distribution is determined by a

finite number of moments, thus fitting enough moments, eventually the moment fit estimator

corresponds to the empirical distribution. While this might not be too bad an estimator for

small (well-filled) tables (if the independence hypothesis has to be rejected) for multivariate

non-binary tables the possibly exorbitantly large number of cells results even for quite large

sample sizes into an extremely sparsely filled observed frequency table. In this situation some

kind of smoothing of the empirical distribution (putting some probability mass on empty cells

in the neighborhood of filled ones) becomes necessary.

A second reason being as follows: If ω is an i.i.d. sample drawn under the law Pα (for α

known), then H(α) := −Z(α) − α · G is an unbiased estimator for H(Pα|P ) (which is even

known to possess good asymptotic properties). However, since we have estimated α from G,

H(α) is not anymore guaranteed to be unbiased, and even worse, the bias is for special cases

known to increase in the model dimension m ! (It then becomes necessary provide a method

for choosing the model dimension). Adjusting for this bias and maximizing the bias-corrected


estimated entropy led to the introduction of several information criteria (IC). To mention

just a few, see Akaike (1973) for the famous AIC, Schwarz (1978) for BIC, Hurvich and Tsai

(1989) and Hurvich et al. (1990) for the corrected AIC, and Konishi and Kitagawa (1996)

for generalized ICs. Asymptotic formulas for the entropy estimation bias have also been

derived for finite probability spaces by Basharin (1959) (see also Clough (1964)). Konishi

and Kitagawa (1996) derive general expressions for the bias (under technical conditions)

for continuous distributions. The intuition provided in Smith (1987) might be helpful to

understanding the AIC. See also Claeskens and Hjort (2008).

The serious drawback of IC approaches is, that the derived biases vary strongly, depending

on which assumptions have been made for the underlying distribution in the model class.

Only for some cases the ICs can be justified in certain asymptotic Bayesian settings, see

e.g. Schwarz (1978) and Akaike (1978). In any case, the above IC-methods (as well as the

MLE approach) seem only to be justified in an asymptotical sense and/or under quite strong

additional assumptions.

The question remains, what to do in a high-dimensional setting for a medium sample size,

i.e. not asymptotically large relative to the dimension of the model.

In fact there is a third interpretation for the MAXENT/MLE approach, which leads to our

Laplace transform fitting approach: Estimating α under the generalized moment constraints

Eα[G] = G means that the Laplace transform LαG, of G w.r.t. Pα approximates the empirical

Laplace transform LG(γ) := N−1∑N

i=1 exp(− γ ·G(ωi)

)near γ = 0, since ∇LG(0) = −G =

−Eα[G] = ∇LαG(0). In this light, it makes sense to give up perfect first order fit of the

Laplace transform at the origin and to try instead to estimate α by approximating LG with

LαG using a generalized moment goodness-of-fit approach. In particular, if the tail behavior

of the estimated distribution are of concern (like e.g. in many risk management application)

one might ask if an estimator that does not enforce perfect fit of the mean of G, but tries

to minimize some goodness-of-fit criterion for the Laplace transform, might lead to improved

(w.r.t. to some performance measure, e.g. bias, variance, quantiles, etc., see Pfanzagl (1994))

results.


A naive approach to this method could be to minimize a (weighted) sum (or integral) over

the square of the (relative) differences of theoretical and empirical Laplace transform. For

more sophisticated approaches, leading to good asymptotic properties of the estimators, see

e.g. Hansen (1982) or Knight and Satchell (1997). Being interested in the non-asymptotic

case, the problem here is, that the distribution of the quantity we use in order to measure

the goodness of fit will in general depend on the (yet to be estimated) parameter α, render-

ing comparison of the deviation for different values of α difficult. Ideally, one would like to

compare the p-values of deviations for different α. This approach can be justified as follows:

Given the observations G, in fact we have also observed all (mixed) polynomials (and func-

tions) in the components of G. This allows to construct an increasing nested sequence of

models by including ever higher moments. (Here we face a problem of choice regarding the

polynomials used and the order in which new polynomials are added). We can expect that

the most parsimonious models are not capable of explaining the observed data at reasonable

confidence levels, allowing to reject them. Proceeding in our nested sequence to models with

more parameters, eventually we will reach a rich enough model class which can not be rejected

anymore. For this model class parametric estimation is done via minimizing the p-value of the

deviation. Parameter choices with high p-values will again be rejected. Those parameters,

that have the lowest p-value in this model class are the ones that are hardest to reject if

the rejection level is lowered, leaving us eventually with the parameters corresponding to the

minimal p-value (if unique). Furthermore, if the p-value increases steeply around its abso-

lute minimum, the area of not rejected parameters will be small, letting us hope for a high

accuracy of the resulting estimator. Note that model size selection, parametric estimation

and back-testing are then carried out in one go ! Admittedly our arguments are heuristic,

however, they are not based on asymptotic arguments (even so it turns out that the estimator

is asymptotically efficient).

For MLE it could happen that the MLE estimator has to be rejected if back-tested by

considering the p-value of the observed likelihood, even so there might exist another parameter

choice (with sub-optimal likelihood) but lower p-value of the observed likelihood not allowing


rejection anymore. Comparing likelihoods (or deviations) for different parameter choices

means in a sense comparing apples with pears ! Without knowing at which sample size the

asymptotic justifications for MLE or standards moment fit estimators start being valid, we

do not understand the meaning of these estimators.

However, optimization of a p-value that has to be calculated by a Monte-Carlo-simulation

(if the distribution is not explicitly known, as it is typically the case) is computationally

extremely expensive.

In the following, we are going to consider a deviation measure Xα, that is comparable for

different values of α, in the sense, that Xα has vanishing expectation and variance 1 w.r.t. Pα

for all α ∈ A.

3. Generalized Moment Fit

Let Ω := (Ω,F , P ) be a probability space and let A ⊆ Rm, 1 ≤ m < ∞, be a set,

containing the origin, and uniquely parameterizing a family of probability measures on Fsuch that Pα ∼ P for all α ∈ A and P0 = P . Denote the expectation operator w.r.t. Pα

by Eα and similar for the (co)-variances. For simplicity of notation, we do not explicitly

distinguish between Pα on Ω and corresponding probability measure on the N -fold product

space ΩN , N ≥ 1.

Set [K] := 1, · · · ,K for 1 ≤ K < ∞. For a random variable g on Ω, g on ΩN is defined

as g(ω) := N−1∑

i∈[N ] g(ωi) for all (i.i.d. sequences) ω = (ωi)i∈[N ] ∈ ΩN .

Let g = (gk)k∈[K] be a sequence of random variables on Ω, square integrable w.r.t. Pα for

all α ∈ A and set µα := (µαk )k∈[K] :=

(Eα[gk]

)k∈[K]

. For sake of simplicity we assume the

components of g to be linearly independent (as elements of L2(P )).

Given an i.i.d. sample ω = (ωi)i∈[N ] ∈ ΩN , drawn under the law of Pα0 for some unknown

α0 ∈ A, a naive way to estimate α0 could be to minimize the weighted sum Dα(ω) :=∑

k∈[K] wk (gk(ω)− µαk )2 and use an optimal α as an estimator for α0.

There are several problems with this approach:


(1) In case that we can achieve a perfect fit, i.e. if there exists an α ∈ A such that

gk(ω) = µαk for all k ∈ [K], then the p-value of Dα(ω) = 0 w.r.t. Pα vanishes, a

typically extremely unlikely outcome, rather indicating serious over-fitting than a

good fit. Only in an asymptotic sense this choice seems to be justified. For small

sample sizes it is not clear why α ∈ A with perfect fit should be a good estimator,

especially if the model size is not known, since arbitrarily many moments can be fitted

with enough parameters.

(2) In case that we can not achieve a perfect fit, it is not clear how to compare Dα(ω) for

different α ∈ A, since the p-value of Dα(ω) w.r.t. Pα might be higher than the p-value

of Dα(ω) w.r.t. Pα even so Dα(ω) < Dα(ω) holds.

(3) How should one choose the weights ? Is there a good way to do this ?

To sum up, except for the asymptotic case we find the line of reasoning leading to standard

moment fit approaches (or MLE) difficult to follow. If on top of the parametric estimation

problem a model (size) selection problem has to be handled, both approaches are not satis-

factory. Artificially splitting up model selection, parametric estimation and back testing into

three separate parts is rather hiding the true difficulty of statistical reasoning. Expect for

the cases where there is no model selection problem (e.g. because there are strong arguments

for a certain model class derived from laws of physics) we always have at least to face a vari-

able selection problem. In particular, for econometric applications based on high dimensional

(balance sheet) data, variable and model selection is essential.

We are going to propose a way how to deal with these problems at least to some extend.

More general, we consider the following quadratic goodness-of-fit measure Dα for α ∈ A

(7) Dα := ‖g − µα‖2Γ := (g − µα)tΓ(g − µα),

where Γ is a symmetric positive definite matrix so that Dα ≥ 0 holds for all α ∈ A on ΩN ,

and Dα = 0 iff P0-a.s. g = µα. (It is possible to let Γ depend on α as well, however, in order

to keep formulas simple Γ is chosen constant here. For example, Γ could be chosen in such a

way, that w.r.t. Pα the asymptotic distribution of Dα does not depend on α ∈ A.)


By including (appropriately transformed and scaled) higher order moments in g, which

are dominated by the extreme observations, we can hope that this approach even works for

extreme value distributions.

We also assume throughout the paper, that Dα is not deterministic and square integrable,

i.e. Varα[Dα] > 0 for all α ∈ A.

In order to make Dα better comparable for different α ∈ A we propose to base estimation

on minimization of the centralized and normalized statistic

(8) Xα :=Dα − Eα [Dα]√

Varα[Dα].

For α ∈ A denote by Fα the cdf of Xα w.r.t. Pα and define its pseudo-inverse by F←α (u) :=

inft ∈ R|Fα(t) ≥ u for u ∈ (0, 1). Note that Eα[Xα] = 0 and Varα[Xα] = 1 for all α ∈ A

and in this sense comparing Xα for different values of α ∈ A means now at least comparing

different types of apples.

We still have the problem, that Xα(ω) and its corresponding p-value w.r.t. Pα are not

guaranteed to be in a co-monotone relationship, however, the situation has at least not got

worse since we now have the following general result, which is an immediate consequence of

Chebyshev’s inequality:

Lemma 3.1. Let Xα, α ∈ A, be a family of random variables such that for all α ∈ A, Xα

has vanishing expectation and variance equal to 1 w.r.t. Pα. For all p ∈ (0, 1) the upper

p-quantiles of Xα w.r.t. Pα, α ∈ A, are then uniformly bounded from above by√

p−1. I.e. for

all α ∈ A and p ∈ (0, 1) we have F←α (1− p) ≤

√p−1.

Proof. The assertion clearly holds for F←α (1−p) ≤ 0. Assume F←

α (1−p) > 0. By definition of

F←α we have Fα

(F←

α (1−p)−ε)

< 1−p for all ε > 0, or equivalently p < 1−Fα

(F←

α (1−p)−ε)

=

Pα

(Xα > F←

α (1 − p) − ε)

for all ε > 0. Hence by Chebyshev’s inequality p ≤ Pα

(Xα ≥

F←α (1− p)

) ≤ Pα

(|Xα| ≥ F←α (1− p)

) ≤ (F←

α (1− p))−2, since Xα has vanishing expectation

and variance 1 w.r.t. Pα. ¤

Remark 3.1. Similarly, the one-sided version of Chebyshev’s inequality gives the stronger

bound F←α (1− p) ≤

√p−1 − 1.


The situation has in fact improved, since now, if we are willing to accept an estimator

α only if the p-value of Xα(ω) w.r.t. Pα is smaller or equal than, say p0 = 0.04, then we

can exclude all α ∈ A with Xα(ω) >√

p−10 = 5 from our search. (Parameters α ∈ A with

Xα(ω) < −√

p−10 indicate over-fitting and it is in this case recommended to use a more

parsimonious model class. For real world data, we expect the true mechanism by which the

observed data were generated to be more complicated than our model, and that our aim is

only to find parameters for a parsimonious model that can not be rejected on a reasonable

confidence level. Similarly as over-dispersion is more often observed than under-dispersion,

we therefore do not expect over-fitting to occur easily.)

This fact does not help us with the numerical minimization itself, which we still have

to do using a gradient search or by an exhaustive grid search algorithm, it rather helps us

with the problem of variable selection or with determining the maximal order with which

(mixed) moments of variables should be included in the model, see Subsection 5.3. The set

α ∈ A|Xα(ω) ≤√

p−10 can also be interpreted as a confidence region at level p0. However,

since this region is derived from Chebyshev’s inequality without using any further properties of

the distribution of g we do not expect this region to be a particularly small confidence region,

it is well possible that using specific properties of the distribution class under consideration

much better confidence regions could be found. Note that it can also happen, that the p-value

corresponding to an α ∈ A with Xα(ω) <√

p−10 is still a lot worse than p0. Therefore, we are

rather interested in the point estimator minimizing Xα and its asymptotic properties. Note,

however, that as for MLE and classical moment fit estimation, we can not escape the problem

of local minima.

4. Sensitivity

In this section we are going to study the precision of the estimator given by the optimal

α ∈ A minimizing Xα. In order to do this we calculate the (perturbed) expectations and

variances of Dα.


For α, α ∈ A, 2 ≤ r ≤ 4 and k = (k1, · · · , kr) ∈ [K]r define

(9) Ck(α, α) := Eα

[r∏

s=1

(gks − µαks

)

], resp. Ck(α, α) := Eα

[r∏

s=1

(gks − µαks

)

],

and Cαk := Ck(α, α), resp. Cα

k := Ck(α, α), so that(Cα

k

)k∈[K]2

, resp.(Cα

k

)k∈[K]2

, equals the

covariance matrix of g, resp. g, w.r.t. Pα. As we are going to see, in order to calculate Eα [Dα]

and Varα[Dα], it suffices to determine Ck(α, α) resp. Ck(α, α), k ∈ [K]r for 2 ≤ r ≤ 4.

For k ∈ [K]2 we find

Ck(α, α) = Eα

[2∏

s=1

((gks − µα

ks) + (µα

ks− µα

ks))]

= Eα

[2∏

s=1

(gks − µαks

)

]+

2∏

s=1

(µαks− µα

ks)

= Cαk +

2∏

s=1

(µαks− µα

ks),

hence we find for the perturbed (i.e. w.r.t. Pα) expectation of Dα

Eα [Dα] =∑

k∈[K]2

Ck(α, α)Γk =∑

k∈[K]2

(Cα

k +2∏

s=1

(µαks− µα

ks)

)Γk

=

∑

k∈[K]2

Cαk Γk

+ ‖µα − µα‖2

Γ

= Eα [Dα] + ‖µα − µα‖2Γ,(10)

where ‖µα − µα‖2Γ := (µα − µα)tΓ(µα − µα).


In order to calculate the perturbed variance of Dα w.r.t. Pα, we start by calculating Ck(α, α)

for k ∈ [K]4:

Ck(α, α) = Eα

[4∏

s=1

(gks − µαks

)

]= Eα

[4∏

s=1

((gks − µα

ks) + (µα

ks− µα

ks))]

= Cαk +

4∏

s=1

(µαks− µα

ks) +

(µαk4− µα

k4)Cα

(k1,k2,k3) + (µαk3− µα

k3)Cα

(k1,k2,k4) +

(µαk2− µα

k2)Cα

(k1,k3,k4) + (µαk1− µα

k1)Cα

(k2,k3,k4) +

(µαk3− µα

k3)(µα

k4− µα

k4)Cα

(k1,k2) + (µαk2− µα

k2)(µα

k4− µα

k4)Cα

(k1,k3) +

(µαk1− µα

k1)(µα

k4− µα

k4)Cα

(k2,k3) + (µαk2− µα

k2)(µα

k3− µα

k3)Cα

(k1,k4) +

(µαk1− µα

k1)(µα

k3− µα

k3)Cα

(k2,k4) + (µαk1− µα

k1)(µα

k2− µα

k2)Cα

(k3,k4),

thus,

Ck(α, α)− C(k1,k2)(α, α)C(k3,k4)(α, α)

= Cαk − Cα

(k1,k2)Cα(k3,k4) +

(µαk4− µα

k4)Cα

(k1,k2,k3) + (µαk3− µα

k3)Cα

(k1,k2,k4) +

(µαk2− µα

k2)Cα

(k1,k3,k4) + (µαk1− µα

k1)Cα

(k2,k3,k4) +

(µαk2− µα

k2)(µα

k4− µα

k4)Cα

(k1,k3) + (µαk1− µα

k1)(µα

k4− µα

k4)Cα

(k2,k3) +

(µαk2− µα

k2)(µα

k3− µα

k3)Cα

(k1,k4) + (µαk1− µα

k1)(µα

k3− µα

k3)Cα

(k2,k4).


For the perturbed variance of Dα we find by symmetry of Γ

Varα[Dα]−Varα[Dα] = Eα

[(Dα)2

]− (Eα[Dα]

)2 −Varα[Dα]

=

∑

k,l∈[K]2

ΓkΓl

(C(k,l)(α, α)− Ck(α, α)Cl(α, α)

)−Varα[Dα]

=∑

k=(k1,k2),l=(l1,l2)∈[K]2

ΓkΓl ×

((µα

l2 − µαl2)C

α(k1,k2,l1) + (µα

l1 − µαl1)C

α(k1,k2,l2) +

(µαk2− µα

k2)Cα

(k1,l1,l2) + (µαk1− µα

k1)Cα

(k2,l1,l2) +

(µαk2− µα

k2)(µα

l2 − µαl2)C

α(k1,l1) + (µα

k1− µα

k1)(µα

l2 − µαl2)C

α(k2,l1) +

(µαk2− µα

k2)(µα

l1 − µαl1)C

α(k1,l2) + (µα

k1− µα

k1)(µα

l1 − µαl1)C

α(k2,l2)

)

= 4∑

k,l∈[K]2

ΓkΓl

((µα

l2 − µαl2)C

α(k1,k2,l1) + (µα

k2− µα

k2)(µα

l2 − µαl2)C

α(k1,l1)

).

In order to find a more explicit expression for Varα[Dα] we calculate Cαk for k ∈ [K]r, 2 ≤

r ≤ 4. Using that ω is an i.i.d. sequence w.r.t. Pα, we find for k ∈ [K]4

Cαk =

∫

ΩN

4∏

s=1

(N−1

N∑

i=1

gks(ωi)− µαks

)dPα(ω) =

N−4∑

(i1,··· ,i4)∈[N ]4

∫

ΩN

4∏

s=1

(gks(ωis)− µαks

)dPα(ω) =

1−N−1

N2

(Cα

(k1,k2)Cα(k3,k4) + Cα

(k1,k3)Cα(k2,k4) + Cα

(k1,k4)Cα(k2,k3)

)(11)

+N−3Cαk ,

resp. for k ∈ [K]3

Cαk =

∫

ΩN

3∏

s=1

N−1N∑

i=1

(gks(ωi)− µα

ks

)dPα(ω)

= N−3∑

(i1,i2,i3)∈[N ]3

∫

ΩN

3∏

s=1

(gks(ωis)− µα

ks

)dPα(ω)

= N−2Cαk ,(12)


and for k ∈ [K]2

Cαk =

∫

ΩN

(2∏

s=1

N−1N∑

i=1

gks(ωi)− µαks

)dPα(ω)

= N−2∑

(i1,i2)∈[N ]2

∫

ΩN

(2∏

s=1

gks(ωis)− µαks

)dPα(ω)(13)

= N−1Cαk ,(14)

Summing up we find

Eα [Dα] = Eα [Dα] + ‖µα − µα‖2Γ = N−1

∑

k∈[K]2

Cαk Γk

+ ‖µα − µα‖2

Γ,(15)

thus

Eα [Dα]− Eα [Dα] = N−1

∑

k∈[K]2

(Cα

k − Cαk

)Γk

+ ‖µα − µα‖2

Γ.

Furthermore, using (11), we find

Varα[Dα] =∑

k,l∈[K]2

ΓkΓl

(Cα

(k,l) − Cαk Cα

l

)

=(N−2 −N−3

) ∑

k,l∈[K]2

ΓkΓl

(Cα

(k1,l1)Cα(k2,l2) + Cα

(k1,l2)Cα(k2,l1)

)+

N−3∑

k,l∈[K]2

ΓkΓl

(Cα

(k,l) − Cαk Cα

l

)(16)

= 2(N−2 −N−3

) ∑

k,l∈[K]2

ΓkΓlCα(k1,l1)C

α(k2,l2)

+

N−3∑

k,l∈[K]2

ΓkΓl

(Cα

(k,l) − Cαk Cα

l

)(17)

= 2N−2

∑

k,l∈[K]2

ΓkΓlCα(k1,l1)C

α(k2,l2)

+

N−3∑

k,l∈[K]2

ΓkΓl

(Cα

(k,l) − Cαk Cα

l − 2Cα(k1,l1)C

α(k2,l2)

),(18)


and

Varα[Dα]−Varα[Dα]

= 4N−2∑

k,l∈[K]2

ΓkΓl

((µα

l2 − µαl2)C

α(k1,k2,l1) + N(µα

k2− µα

k2)(µα

l2 − µαl2)C

α(k1,l1)

),

so that

Varα[Dα]Varα[Dα]

=Varα[Dα]Varα[Dα]

(1 +

Varα[Dα]−Varα[Dα]Varα[Dα]

)

=(

1 +Varα[Dα]−Varα[Dα]

Varα[Dα]

)(1 +

Varα[Dα]−Varα[Dα]Varα[Dα]

).

It is important to note that therefore Varα[Dα]Varα[Dα] = 1+O(

√N−1) if ‖α−α‖ is of order O(

√N−1).

Note also, that setting x := x(α, α, N) := Eα[Xα], we find

x =Eα [Dα]− Eα [Dα]√

Varα[Dα]

=N‖µα − µα‖2

Γ +∑

k∈[K]2(Cα

k − Cαk

)Γk√

2∑

k,l∈[K]2 ΓkΓl Cα(k1,l1)C

α(k2,l2)

+ O(√

N−1).(19)

Proposition 4.1. For x < x we have

(20) Pα (Xα ≤ x) ≤ Varα[Dα]Varα[Dα]

(x− x)−2 =: q(x, α, α, N) = q.

Proof. For x ∈ R and α ∈ A we have

Pα (Xα ≤ x) = Pα

(Xα − Eα[Xα]√

Varα[Xα]≤ x− Eα[Xα]√

Varα[Xα]

)

= Pα

(Dα − Eα[Dα]√

Varα[Dα]≤ (x− x)

√Varα[Dα]Varα[Dα]

).

For x < x, again by Chebyshev’s inequality, we have

Pα (Xα ≤ x) ≤ Pα

(∣∣∣∣∣Dα −Eα[Dα]√

Varα[Dα]

∣∣∣∣∣ ≥ (x− x)

√Varα[Dα]Varα[Dα]

)

≤ Varα[Dα]Varα[Dα]

(x− x)−2 .

¤


Note that Proposition 4.1 implies that for all α, α0 ∈ A with ‖µα0 − µα‖Γ > 0 and for

any fixed x ∈ R, we have limN→∞ Pα0 (Xα ≤ x) = 0 since in this case (see equation (19)),

x(α, α0, N) →∞ as N →∞. Hence Xα diverges to +∞ in law w.r.t. Pα0 for all α ∈ A such

that ‖µα0 − µα‖Γ > 0. Intuitively speaking this means (for large enough sample size) that

Xα increases steeply around α, resulting into a small area of non-rejectable parameter choices

α ∈ A and indicating a high precision of our estimator α.

Note that for all α0 ∈ A, Xα0 converges by the multivariate central limit theorem in law

w.r.t. Pα0 to a finite valued (centralized and normalized χ2-type of) probability distribution.

Assuming for α, α0 ∈ A, ‖µα0−µα‖Γ = 0 iff α = α0, this means that estimation by minimizing

Xα is in fact a minimum contrast estimation with the contrast function g(·, α0) := K1A\α0

defined on A for any K > 0, since lim infN→∞(Xα − Xα0) ≥ g(α, α0) w.r.t. Pα0 for all

α, α0 ∈ A. See e.g. Guyon (1995) or Anh et al. (2007) for minimum contrast estimators.

However, due to Chebyshev’s inequality we can in addition reject models at a given confidence

level depending on the minimal Xα.

The extreme type of contrast function leads us to expect that the discriminative power

of our estimator will be strong for sufficiently large N . Intuitively speaking, Proposition 4.1

allows us to asymptotically determine the precision of the estimator given as the optimal

α ∈ A minimizing Xα. Consider the following situation:

Let Uα ⊆ A be a neighborhood of α such that on Uα the first order approximation µα =

µα +∇µα · (α− α) is valid up to a precision of order O(N−1). We then have ‖µα − µα‖2Γ =

‖α− α‖Γα + O(N−2) for Γα := (∇µα)tΓ∇µα. Assume that Γα is again strictly positive with

smallest eigenvalue λ > 0. It follows on Uα that N‖µα − µα‖2Γ ≥ Nλ‖α− α‖2 + O(N−1) and

if ‖α− α‖ is of magnitude

(21) d :=

√a + 1

λ

(Varα[Dα]

p0

) 14 √

N−1,

for a > 0, we find approximately x ≥ (a + 1)√

p−10 + O(

√N−1).

If Xα := Xα(ω) ≤√

p−10 is minimal at α ∈ A, then for large enough N the probability

q of observing Xα ≤ Xα if the true law was in fact Pα for a α ∈ A such that ‖α − α‖ is of


magnitude d, is approximately bounded by

q(Xα, α, α, N) =Varα[Dα]Varα[Dα]

(x− Xα

)−2= (x− Xα)−2 + O(

√N−1)

≤(

(a + 1)√

p−10 −

√p−10

)−2

+ O(√

N−1)

= p0a−2 + O(

√N−1).

Conversely, if for fixed a > 0 a precision of d is desired for the estimation of the optimal α, a

worst case sample length N can be approximately calculated from equation (21).

In order to estimate the optimal α ∈ A, we suggest to start by minimizing Dα over A. In

a second step, Xα can then be minimized by a gradient search algorithm. Alternatively, for

large N , we approximately have

(22) Xα =Dα√

Varα[Dα]=

NDα√2

∑k,l∈[K]2 ΓkΓlC

α(k1,l1)C

α(k2,l2)

,

and the right side of the last expression can be minimized.

If the optimal α is expected to be near some α0 ∈ A, then Γ can be chosen such that Γα0

equals the identity matrix, so that the minimal eigenvalue λ will be approximately 1 too.

There are many possible choices for the generalized moments g. E.g. one could include all

moments of G up to some order, leading to an improved fit of the Laplace transform near

the origin. Alternatively (or in addition) one can as well include exponential moments of the

type exp(−γ ·G) in g for γ ∈ Rm \ 0. For computational efficiency it is crucial, that all the

corresponding generalized moments µα are known or can be computed for all α ∈ A. E.g. if

LG is known, then exponential moments can be calculated since LαG = LG(α+·)

LG(α) .

In a non-parametric approach one could choose the components of g to be orthogonal at

α0 ∈ A, e.g. by orthogonalizing a general g or by choosing orthogonal polynomials as the

components of g.

5. Applications

In this section we apply our approach to two statistical problems.


5.1. Fitting one-dimensional data. In order to illustrate the proposed method, we apply

the generalized moment fit to simulated artificial data having realistic parameters related to

the practice of (operational) risk management.

Our main aims are to compare the generalized moments fit with the maximum likelihood

estimator in application to various data sets and to show that the use of these two methods

simultaneously can increase the confidence in the resulting fit and make the modelling more

reliable.

Recall the common setting: one has a sample ω = (ωi)i∈[N ] ∈ ΩN , drawn under the law

of Pα0 and wishes to estimate the parameters’ vector α0. Assume that we deal with the

one-dimensional case and our sample is a sequence of i.i.d. random variables X1, . . . , XN ,

sampled from a distribution Fα for some α ∈ A.

Recall the quadratic goodness-of-fit measure Dα defined by (7) and take the matrix Γ

simply as a unit matrix, such that the expression Dα reduces to the sum:

(23) Dα :=K∑

k=1

Dαk :=

K∑

k=1

(gk − µαk )2 .

Minimizing Dα would correspond to the ”naive way” of the estimation of α. Instead, we

minimize the centralized and normalized statistics Xα, defined as in (8).

In order to use the expressions for Eα[Dα] and Varα[Dα] obtained in Section 4, let us first

rewrite definitions of Ck(α, α) and Ck(α, α) given by (9), which become simpler in our case:

(24) Cαk,k = Eα

[(gk − µα

k )2], resp. Cα

k,k = Eα

[(gk − µα

k )2],

Thus for k = 1, . . . ,K,

(25) Cαk,k = Eα

[(N−1gk − µα

k

)2]

= N−1Varα[gk] = N−1Cαk,k

and so the expression for numerator in (8) reduces to

(26) Eα[Dα] = N−1K∑

k=1

Varα[gk],

which is a special case of (15).


Consider now a simple choice of the moment gk for the fitting. Particularly, let us base the

estimation of the parameters’ vector α on the Laplace transform of the distribution Fα taken

on a chosen grid within some fixed interval. That is, µαk :=

∫exp(−τkx)dFα(x)

= Eα[exp(−τkX1)] = Eα[gk] and gk := N−1N∑

i=1exp(−τkXi), where τk = τ0 + (k − 1) ·

T/(K − 1),

k = 1, . . . ,K. The parameters τk ≥ 0 and T > 0 define the interval on the support of the

Laplace transform which the estimation is performed on.

Remark 5.1. Though in our setting it is the uniform grid on the interval [τ0; τ0 + T ] that

is chosen for the estimation, the further argument could be easily changed for the case of the

more complex structure of the grid.

In the following argument, we denote for sake of simplicity µ[τk] ≡ µαk . Note that µ[τl+τk] =

Eα[exp(−(τl + τk)X1)].

Recall (26), denote Dαk := Varα[gk] = µ[2τk] − µ2

[τk] and note that in the specified case the

statistics Xα can be written as

(27) Xα :=

∑k

[Dαk −Eα [Dα

k ]]√

Varα[Dα]=

∑k

[N

(gk − µ[τk]

)2 −Varα(gk)]

N√

Varα[Dα].

The calculation of the quantity Varα[Dα] forming the denominator of (27) in this cases

reduces to the computation of the double sum containing the cross-products as well as the

squares of the quantities Dαk :

Varα[Dα] =K∑

k=1

K∑

l=1

Cov (Dαk , Dα

l )

=K∑

k=1

K∑

l=1

Eα

[(gk − µ[τk]

)2 (gl − µ[τl]

)2]−Eα[Dα

k ]Eα[Dαl ]

The latter expression can be simplified significantly. Recall formula (17). Considering a

special choice of the matrix Γ simply as a unit matrix, and check that in our case it reduces


to

Varα[Dα] = 2N−2

∑

k,l=1,··· ,KCovα(gk, gl)

+ N−3

∑

k,l=1,··· ,KEα

[(gk − µ[τk]

)2 (gl − µ[τl]

)2]

−2 [Covα(gk, gl)]2 −Varα(gk)Varα(gl)

)

= 2N−2∑

k,l=1,··· ,K

(µ[τk+τl] − µ[τl]µ[τk]

)2 +

N−3∑

k,l=1,··· ,K

(µ[2(τk+τl)] − 2(µ[(τk+τl)])

2 − µ[2τk]µ[2τl] + 2µ[2τl](µτk)2

+ 2µ[2τk](µ[τl])2 − 2µ[τl+2τk]µ[τl] − 2µ[τk+2τl]µ[τk] + 8µ[τk+τl]µ[τk]µ[τl] − 6µ2

[τk]µ2[τl]

).(28)

5.2. Estimation of the Generalized Pareto Distribution. Let us turn now to the im-

plementation of this method for the case when the c.d.f. Fα belongs to a specified family

of distribution functions. As we were interested in applications for risk management the so

called Extreme Value Theory should be referred to.

The Extreme Value Theory (EVT) originates from the works of Fisher and Tippett who

investigated limiting distributions of maxima and minima of a random sample. The funda-

mental theorem of EVT states that the limit distribution of (properly shifted and normalized)

maxima of a random population can be represented as a family of distributions having the

form

H(x) = exp−

[1 + ξ

(x

σ

)]−1/ξ

where 1 + ξ(x

σ

)> 0

which is associated with Generalized Extreme Value (GEV).

Using a straightforward argument, one can pass from GEV to Generalized Pareto distribu-

tion as a limit distribution for the scaled exceedances over a sufficiently high threshold:

(29) G(x) =

1− [1 + ξ

(xσ

)]−1/ξ, ξ 6= 0, σ > 0

1− exp(1− x/σ),

where the exponential distribution appears as the degenerate case for ξ = 0. When the shape

parameter ξ > 1, the first moment does not exist.

GPD is widely used in applications, including actuarial practice, where rare and severe

events appear naturally, thus the tails of distributions are of a particular interest, and fitting


the data that exceeds high threshold is an appropriate methodology. For GPD estimation

approaches see e.g. Castillo and Hadi (1997), Chouklakian and Stephens (2001) and Luceno

(2006).

The common methodology used for fitting the GPD in most practical applications is Max-

imum likelihood estimation. To estimate the shape parameter ξ critical for heavy tails, such

methods as Hill estimator are often used. Dealing with actuarial data one is usually restricted

to limited databases embracing several years or even only months of historical losses. Thus the

reliability of the common estimation methods may become questionable. A thorough discus-

sion of the statistical estimation using MLE, Hill estimation and other techniques regarding

coarse actuarial data can be found in Embrechts et al. (1997).

We are going to work with a subclass of GPD in slightly different parametrization than

(29), e.g. with c.d.f. and p.d.f. respectively

(30) Fa, σ(x) = 1−(1 +

x

aσ

)−a, fa, σ(x) =

1σ

(1 +

x

aσ

)−a−1, a, σ > 0 x > 0.

Working with logarithmically transformed losses simplifies the setup, as the exponential dis-

tribution comes into play:

(31) Fa, σ(x) = 1− exp(−a(ln(x + aσ)− ln[aσ])).

Note that, nevertheless, GPD does not belong to exponential families of distributions, as after

introducing the notation y := ln(x+aσ) we still have the parameters a and σ included in the

new argument.

Remark 5.2. We remark on the range for the shape parameter. In the original definition

of GPD, the range of values for the shape parameter ξ can be naturally divided into ξ < 0,

ξ = 0 and ξ > 0, such that ξ > 0 can be considered as a heavy-tailed case and ξ < 0

corresponds to (in general) lighter tails than the exponential distribution. In our application,

we were particularly interested in the heavy tailed region of parameters. That is why we restrict

ourselves to the range ξ > 0; moreover, the re-parametrization a = 1/ξ allows to concentrate

on the heavy tailed side of the shape parameter, corresponding to small a. We exclude the

case a = 0 (as well as σ = 0) as a border for the possible parameter’s range, as these cases


are not reasonable in applications. At the same time, the case ξ = 0 corresponding to a →∞can be formally included in our model.

However, one may use this notation to obtain the values connected with the statistics Xα.

Specifically, we have the following expressions for the first and second exponential moments,

as well as for the variance:

µ[τk] =a([aσ]−τk)

a + τk, µ[2τk] =

a([aσ]−2τk)a + 2τk

, Varα[gk] =a([aσ]−2τk)τ2

k

(a + 2τk)(a + τk)2.

Expressions for the mixed moments are also easy to write down. Using (27) and (28), the sta-

tistics Xα can be calculated, and the task of its minimization can be solved for this particular

case.

Then the minimization for the moment fit can be made by a regular grid search algorithm.

Certainly, it is just the simplest version of the optimization scheme, eligible for the one-

dimensional case. In higher dimensions, searching the minimum within the grid should be

replaced by a suitable gradient search algorithm or other relevant minimization techniques.

In our application, minimization of the statistics Xα in question was performed using a

numerical SPlus procedure based on a quasi-Newton method using the double dogleg step

with the BFGS secant update to the Hessian. In order to apply the maximization procedure,

one needs to choose the domain of the parameters’ search. In the present algorithm, we

used the method based on p-values estimation outlined in the discussion part of Section 3.

Specifically, we select the preliminary (wider) domain and estimate the range of p–values

associated with the statistics Xα using Chebyshev equality as formulated in Lemma 4.1 and

exclude the values with Xα >√

p−10 (where p0 is pre-chosen) from the consideration. We

select p0 such that√

p−10 = 5.

Anticipating the application results, we indicate the plot of the statistics −Xα for a pseudo-

random sample simulated from GPD on Figure 1. We note that the statistics as the function

of the parameters is (visually) convex it a certain region and that the range of parameters

whose p−values exclude them from the vicinity the minimum is not hard to select (although


the minimization could be a bit of a time-consuming numerical problem due to the slow

variation of −Xα w.r.t the scale parameter).

1

2

3

4

5

scale

1

2

3

4

shape

-0.5

00.

51

-X_a

lpha

Figure 1. Plot of the statistics −Xα for a pseudo-random sample generated from

GPD with parameters (1, 3)

Applying this estimation to the samples simulated from the generalized Pareto distribution,

one can trace the change of statistical properties of the obtained estimates with the change

of the sample size. If we apply the maximum likelihood estimation to the same samples, the

direct comparison of the two estimates becomes possible.

The following table indicates the values of the observed mean and standard deviation of the

estimates (a, σ). The samples of selected sizes N were simulated from GPD with parameters

(a = 2, σ = 3); for each fixed sample size, 1000 repetitions were made.

One observes that the precision of generalized moment fit tends to be better than the

one of MLE for smaller sample sizes. However, for the relatively large samples MLE still

demonstrates a comparable performance. In order to improve the generalized moment fit,


GMF(based on Laplace moments) MLE

N mean stand.d. mean stand.d.

100 (2.21, 2.96) (0.91, 0.52) (2.47, 3.068) (1.51, 0.56)

200 (2.08, 2.98) (0.49, 0.37) (2.16, 3.03) (0.65, 0.38)

500 (2.02, 2.999) (0.287, 0.235) (2.05, 3.014) (0.298, 0.235)

1500 (2.015, 3.006) (0.161, 0.136) (2.023, 3.01) (0.161, 0.135)

5000 (2.006, 3.004) (0.086, 0.072) (2.006, 3.002) (0.086, 0.072)

Table 1. Comparison of the generalized moment fit based on the Laplace transform

with the MLE.

we could extend the model further by including the first and the higher moments into the

estimation, besides the part based on the Laplace transform.

Specifically, assume again that we have a sample Xi from GPD (30), then besides fitting

the Laplace moments Eα[exp(−τkYα1 )] of the transposed variables Y α

i = ln (Xi + aσ), the

first moment µ(1)α = Eα [Y α] can be added to the estimation. That is, one could consider

the combined statistics Dα simply as a sum of Dα used before and D(1)α based on the first

moment D(1)α = N

K∑k=1

(gk − µ[τk]

)2 +(g(1) − µ

(1)α

)2. Then the centralized and normalized

statistics Xα to be optimized should also be updated (recall (27) and plug in Dα instead of

Dα). Naturally, the model could be updated further by including higher moments.

However, the contributions of the Laplace moments and the standard moments may appear

to be rather unbalanced. For significantly skewed distributions, the relation between the

second and the first moments can already be of a multiple order, and the first moment can

be significantly larger than the sum of the Laplace moments over the chosen grid. Thus the

magnitudes of the corresponding statistics Dα, Dα, etc can also be of different orders, which

might lead to a bias of the estimate based on the combined fit of this type.

That can be seen from Figure 2, illustrating the quality of the fit in Laplace moments for

the combined fit in the form described above (with the second moment µ(2)α = Eα

[(Y α)2

]


included), and also for the combined fit with normalized components, to be described be-

low. One observes a significant error in the Laplace moments fit of the described type, which

is obviously caused by a better fit of the first and/or second moments. The correspond-

ing estimate obtained by the described fit is significantly biased: (2.74, 2.92), with the true

parameters being equal to (2.0, 3.0).

Selected support for the Laplaces moments fit

Mag

nitu

de o

f the

Lap

lace

tras

fom

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

Combined fit, Observed Laplace mom.Combined fit, theoretical Laplace mom.Combined and normalized, OBSCombined and normalized, THEOR

Figure 2. Plots indicating the difference between the observed and the analytically

calculated magnitudes of Laplace transform on the selected grid, corresponding to the

estimates obtained by the generalized moment fit (Laplace fit combined with the first

and the second moments) of two different setups (Xα and ˜Xα).

To overcome this difficulty, one can normalize each of the statistics Dα, Dα, . . . separately

before summing them up. An easy way to demonstrate the construction of such estimate

is to consider the fit based on the Laplace moments and the first moment, recalling the

representation (7) for the statistics Dα via a positive symmetric matrix Γ. We intend to


construct a new statistics ˜Dα based on the normalized Dα and normalized D

(1)α . Note that

the arguments of Section 4 allow to make Γ depend on α. We assumed Γ to be the unit matrix

previously in this section; let it now be a diagonal matrix with first k diagonal elements taking

values of 1/√

Var[Dαk ] correspondingly and the (k + 1)th diagonal element taking the value

of 1/

√Var[D(1)

α ]. Then one can use formula (7) to obtain ˜Dα, and the statistics ˜

Xα can then

be calculated.

It is also possible to construct ˜Dα in a simpler way, based on the variance of the summed

statistic Dα. That is, the first k diagonal elements would then take the same value of

1/√

Var[Dα]. Numerical modelling shows that this particular construction of the combined

moment fit leads to quite satisfactory results. It is illustrated by Figure 2 (see the part of the

plot concerning the combined and normalized fit). The corresponding estimate obtained by

the described fit with normalized components is (2.04, 2.98) (with the true parameters equal

to (2.0, 3.0)).

The performance of the fit is also illustrated by the series of numerical examples, summa-

rized in Table 2.

Thus, the generalized moment fit of the specified form proves to be as least as efficient

as MLE also for large sample sizes up 104. At the same time, for smaller sample sizes, the

moment fit is significantly better. That can be also seen for the graphical illustration: Figure

3 depicts the comparison between the generalized moment fit and the MLE.

We complete the numerical example for simulated data by a brief comparison with the

classical moment fit, i.e. the one that does not include the centralization and normalization

of the optimized statistics.

The method of moments is used in many applications of statistics, including the analysis

of heavy tailed data, see e.g. Panjer (2006). If theoretical moments have simple analytical

representations, the method can be reduced to solving the system of equations with respect to

the distribution parameters. In other cases, numerical optimization is needed. In our setting

for Laplace moments, the classical way of fitting would correspond to the minimization of the

statistics Dα, without passing to the centralized and normalized statistics Xα.


GMF(Norm.Lapl.+norm. first mom.) MLE


50 (2.53, 3.04) (1.70, 0.72) (2.91, 3.16) (2.64, 0.79)

100 (2.18, 3.04) (0.75, 0.53) (2.53, 3.08) (1.50, 0.54)

200 (2.13, 3.009) (0.49, 0.37) (2.166, 3.02) (0.65, 0.37)

500 (2.07, 3.03) (0.29, 0.23) (2.07, 3.03) (0.295, 0.235)

1500 (2.015, 3.006) (0.155, 0.133) (2.016, 3.006) (0.16, 0.135)

3000 (2.007, 2.998) (0.11, 0.095) (2.009, 3.003) (0.114, 0.096)

5000 (2.002, 2.997) (0.085, 0.074) (2.003, 3.002) (0.088, 0.073)

104 (2.0013, 2.9991) (0.059, 0.052) (2.001, 2.999) (0.059, 0.052)

Table 2. Comparison of the combined GMF (based on the Laplace transform and

the first moment with separately normalized components) with the MLE. The samples

of selected sizes N were simulated from GPD with parameters (a = 2, σ = 3); for each

fixed sample size, 5000 repetitions were made.

GMF(Laplace moments) Classical MF


100 (2.17, 3.05) (0.96, 0.54) (2.51, 3.08)1 (1.64, 0.6)

200 (2.14, 3.035) (0.5, 0.38) (2.15, 3.05)2 (0.53, 0.385)

500 (2.07, 3.03) (0.31, 0.24) (2.06, 3.04) (0.30, 0.24)

Table 3. Comparison of the combined GMF (based on the Laplace transform) with

the classical moment fit. The samples of selected sizes N were simulated from GPD

with parameters (a = 2, σ = 3); for each fixed sample size, 2000 repetitions were

made.

The observed bias in the classical moment estimates in the case of small sample sizes (see

the results for sample size 100 in the Table 3) is possibly a reflection of the effect related to

the incorrect comparison of the deviation for different values of α, as discussed at the end


MLE

Shape parameter (true value 2.0)

Sca

le p

aram

eter

(tr

ue v

alue

3.0

)

0 10 20 30 40

23

45

Generalized moment fit


Sca

le p

aram

eter

(tr

ue v

alue

3.0

)

0 10 20 30 40

23

45

MLE


Sca

le p

aram

eter

(tr

ue v

alue

3.0

)

0 5 10 15 20

23

45

Generalized moment fit


Sca

le p

aram

eter

(tr

ue v

alue

3.0

)

0 5 10 15 20

23

45

Figure 3. The performance of the generalized moment fit in comparison with MLE

for a sample size of 100. The comparison between GMF and MLE was based on the

same simulated samples, the series of 2000 repetitions was made. The upper plots

include all the outcomes of the fitting, while the lower plots depict only the points

within the domain (1.5, 5.5; 0, 20).

of Section 3. Besides, the minimization procedure often fails to find the minimum of the

statistics in the case of classical moment fit for small sample sizes (none of the implemented

numerical minimization methods such as simplex method or the gradient calculation method

does work in this case), making it impossible to consider the classical moment fit as a reliable

method.

We finalize this paragraph with an additional illustration of the difference between the two

estimates basing on different choices of parameters. We choose the domain in the shape–scale

(a− σ) parameters’ space such that the points we select in this domain embrace the realistic


range of GPD parameters relevant in actuarial practice. Results of the estimates are indicated

in Figure 4.

shape

scal

e

0 2 4 6 8 10

01

23

45

6

GMFMLETrue

Figure 4. Estimation of the parameters from simulated data with MLE and Gen-

eralized moment fit. For three of the five choices of the parameters indicated on the

plot, samples of 75 data points were used; for two points (the two most right-hand

points on the plot) the sample size was 50. Crosses indicate standard deviations,

which were estimated by 500 independent simulation at each point

1For the classical moment fit, the results are given after the exclusion of the extremely outlying points,

corresponding to the cases when the minimum of Xα was found on the border of the search domain. For the

sample size 100, 1.2% of simulated samples lead to such cases.2Like in case of sample size 100, extremely outlying points still appear. For the sample size 200, approxi-

mately 0.15% of simulated samples produce the cases when the minimum of Xα corresponds to the border of

the selected parameters’ domain.


Applications to real data, considering p-values. As already mentioned, our choice of the distri-

bution for the numerical example was motivated by the special role of GPD in extreme value

theory and its applications (actuarial science in particular). Naturally, as a continuation of

the numerical example, we indicate some results of fitting selected samples from historical

loss data related to operational risk. The data we used was provided by a particular financial

institution, but, since confidentiality does not allow us to present real values, absolute values

of all the data were scaled by some factors before fitting.

As operational risk data are classified by the so called Lines of business (BLs), we selected

some of them for the comparative analysis of the corresponding fits. The BLs of our choice

correspond to the different segments of the business of the financial institution, respectively

different are the characteristics of the severity distributions. For each of selected BLs, we take

the exceedances of losses above the threshold th = 2 and fit them with GPD using MLE and

generalized moment fit based on Laplace transforms.

BL Sample GMF (Laplace moments) MLE

number size estimate Xα (opt.) estimate Xα

BL 1 585 (1.07, 2.75) −0.56 (1.05, 2.56) −0.71

BL 3 170 (1.01, 5.76) −0.68 (1.015, 5.55) −0.69

BL 8 143 (1.00, 4.99) −0.46 (0.96, 4.82) −0.70

Table 4. Comparison of the combined GMF (based on the Laplace transform) with

the MLE for the operational loss data corresponding to particular BLs of a financial

institution.

Thus, for the considered real data the estimates with two methods are close, moreover —

the values of the statistics Xα corresponding to the parameters obtained via MLE are still

negative, as well as the ones corresponding to the optimal generalized moments fit.

5.3. Model Selection. Modern data collection and storage technology has often the effect,

that the dimension of available data is quite high, while it is rather the sample size that


remains limited. E.g. balance sheets contain a lot of relevant information about a company,

while the number of companies for which balance sheet data are available is often quite small

compared to the number of variables provided. On one hand, we have to cope with high-

dimensional data, while on the other hand, we can not expect to have large enough samples in

order that asymptotic (optimality, efficiency, consistency) properties of standard estimators

can be justified to approximately hold. One way to overcome this problem, are variable

selection methods that allow to reduce the dimension of the data, such that relative to the

reduced variable set, the sample size becomes large enough in order to justify the use of a

particular estimator.

Let us describe how to model selection can be done. Assume that we have a (eventually

constant) nested sequence 0 = A0 ⊆ A1 ⊆ · · · ⊆ ⋃i≥0 Ai = A, describing increasingly

complex models for the true law P of the observed sample ω ∈ ΩN . If our task is to find

the smallest model class i ≥ 0 such that there exists an α ∈ Ai, with corresponding p-value

w.r.t. Pα of observing Xα less than some given confidence level p0 := t−10 ∈ (0, 1), and if

infα∈Aj Xα > t0 holds for j ≥ 0, then we have at least to choose i > j. On the other hand,

infα∈Aj Xα < −t0 for some j ≥ 0 indicates over-fitting for small p. (Of course, if X0 < −t0 we

are in trouble (and something might be wrong with the data)). But typically, for real world

data, we expect Xα > 0 for α ∈ Ai for small i ≥ 0 and only if the full model A allows to

approximate the empirical law of the observed sample (e.g. if g consists of a complete system

of polynomials, orthogonal w.r.t. P ), then infα∈Aj Xα will always eventually become negative

as j increases.

The rationale of a goodness-of-fit estimation is to determine for a j ≥ 0, α ∈ Aj such that

infα∈Aj Fα

(Xα

)= Fα

(Xα

)holds, i.e. to find the α ∈ Aj which minimizes the p-value of Xα

w.r.t. Pα, and to find the smallest j such that this p-value becomes acceptable. As already

mentioned, doing this is typically computationally too expensive and we propose to minimize

Xα over Aj instead. We can not expect α ∈ Aj with Xα = infα∈Aj Xα to equal α exactly.

(This is actually similar as for maximum likelihood estimation). However, Proposition 4.1

tells us, that for large enough N , it becomes increasingly unlikely that α is far away from α.


Furthermore, for Xα ≤ t0, there is no guarantee that the p-value of Xα w.r.t. Pα is acceptable,

we only know that it is not acceptable for Xα > t0. (The calculation of the p-value of Xα

w.r.t. Pα can be done by importance sampling Monte-Carlo techniques, using the weight dPαdP ,

provided sampling under the measure P can be done efficiently.) In any case, α can serve as a

starting point for a (e.g. stochastic) search for the optimal α in the region α ∈ Aj | Xα ≤ t0.

6. Conclusion and outlook

In the present paper, we approached the classical problem of point estimation (and jointly

model selection). We present a new insight on this problem by considering a special type of

optimizing statistics based on generalized moments. We have shown that the introduction

of the centralized and normalized statistics into the procedure of the moment fit allows to

achieve useful asymptotic properties of the resulting estimate.

Application of the proposed method to statistical data shows that its properties are par-

ticularly useful when dealing with relatively small sample sizes. Numerical examples based

on the generalized Pareto distribution show that the generalized moment fit outperforms

maximum likelihood estimation, considering both the estimation precision and stability.

Further steps in the investigation of the considered method will include applications with

high dimensional data, e.g. data related to multivariate financial processes. Our numerical

results indicate that the generalized moment fit can be particularly useful in application to

heavy tailed data, which can also be a motivation for future research.

References

Akaike, H. (1973): Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov, B.N. and F. Csaki,

(Eds.), 2nd International Symposium on Information Theory, pp. 267–281, reprinted in: Kotz and Johnson (1992),

pp. 610–624.

Akaike, H. (1978): A Bayesian Analysis of the Minimum AIC Procedure. Ann. Inst. Statist. Math. Tokyo 30A, 9–14.

Anh, V.V., N.N. Leonenko and L.M. Sakhno (2007): Minimum contrast estimation of random processes based on information

of second and third orders. Journal of statistical planning and inference 137, 1302–1331.

Basharin, G.P. (1959): On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables. Theory of

Probability and its Applications 4, 333–336. Translated from: Teoriya Veroyatnostei iee Primeneniya 4, p. 361, 1959.

Campbell, L.L. (1970): Equivalence of Gauss’s Principle and Minimum Discrimination Information Estimation of Probabilities.

The Annals of Mathematical Statistics 41(3), 1011–1015.


Castillo, E. and A.S. Hadi (1997): Fitting the Generalized Pareto Distribution to Data. Journal of the American Statistical

Association 92(440), 1609–1620.

Chouklakian. V. and M.A. Stephens (2001): Goodness-of-fit tests for the Generalized Pareto Distribution. Technometrics 43(4),

478–484.

Claeskens, G. and N.L. Hjort (2008): Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic

Mathematics, Vol. 26. Cambridge: University Press.

Clough, D.J. (1964): Application of the Principle of Maximizing Entropy in the formulation of Hypotheses. Canadian

Operational Research Society (CORS) Journal 2(2), 53–70.

Crain, B.R. (1973): A Note on Density Estimation Using Orthogonal Expansions. Journal of the American Statistical

Association 68, 964–965.

Cramer, H. (1946): Mathematical Methods of Statistics. Princeton: University Press.

deLeeuw, J. (1992): Introduction to Akaike (1973) Information Theory and an Extension of the Maximum Likelihood Principle.

In: Kotz and Johnson (1992), pp. 600–609.

Embrechts, P., C. Kluppelberg and T. Mikosch (1997): Modelling extremal events. Applications of Mathematics, Volume 33.

Berlin: Springer.

Good, I.J. (1963): Maximum Entropy for Hypothesis Formulation, especially for multidimensional contingency tables. Annals

of Mathematical Statistics 34(3), 911–934.

Grendar, M. (jr.) and M. Grendar (2001): Maximum Entropy: Clearing up Mysteries. New York: Marcel Dekker.

Guyon, X. (1995): Random fields on a network. Berlin: Springer.

Hansen, L.P. (1982): Large sample properties of generalized method of moments estimators. Econometrica 50(4), 1029–1054.

Hurvich, C.M. and C.-L. Tsai (1989): Regression and time series model selction in small samples. Biometrika 76(2), 297–307.

Hurvich, C.M., R. Shumway and C.-L. Tsai (1990): Improved estimators of Kullback-Leibler information for autoregressive

model selction in small samples. Biometrika 77(4), 709–719.

Jaynes, E.T. (1957): Information Theory and Statistical Mechanics. The Physical Review 106(4), 620–630.

Jaynes, E.T. (1979): Where do we stand on Maximum Entropy ? In: Levine, R.D. and M. Tribus (Eds.), The Maximum

Entropy Formalism, pp. 210–314, (reprinted in Jaynes (1981)). Cambridge: MIT Press.

Jaynes, E.T. (1981): Papers on Probability, Statistics and Statistical Physics. Rosenkrantz, R. (Ed.), Dordrecht: Reidel.

Jaynes, E.T. (1982): On the Rationale of Maximum-Entropy Methods. Proceedings of the IEEE 70(9), 939–952.

Jaynes, E.T. (2003): Probability Theory: The Logic of Science. Cambridge: University Press.

Johnson, N.L. and N. Balakrishnan (1997): Advances in the theory and practice of statistics: a volume in honor of Samuel

Kotz. Wiley series in probability and statistics. New York: Wiley.

Kapur, J.N. and H.K. Kesavan (1992): Entropy Optimization Principles with Applications. San Diego: Academic Press.

Kapur, J.N. (1994): Measures of Information and their Applications. New York: Wiley.

Knight, J.L. and S.E. Satchell (1997): The cumulant generating function estimation method. Econometric Theory 13, 170–184.

Konishi, S. and G. Kitagawa (1996): Generalised information criteria in model selection. Biometrika 83(4), 875–890.

Kotz, S. and N.L. Johnson, (Eds.) (1992): Breakthroughs in Statistics, Vol. I. New York: Springer.

Kronmal, R. and M. Tarter (1968): The Estimation of Probability Densities and Cumulatives by Fourier Series Methods.

Journal of the American Statistical Association 63, 925–952.

Kullback, S. (1968): Information Theory and Statistics. New York: Dover Publications.

Luceno, A. (2006): Fitting the generalized Pareto distribution to data using maximum goodness-of-fit estimators. Computa-

tional Statistics & Data Analysis 51, 904–917.

Panjer, H.H. (2006): Operational Risk – Modeling Analytics. Hoboken, New Jersey: Wiley.

Pfanzagl, J. (1994): Parametric Statistical Theory. Berlin: de Gruyter.

Schwarz, G. (1978): Estimating the Dimension of a Model. The Annals of Statistics 6(2), 461–464.

Smith, P.L. (1987): On Akaike’s Model-Fit Criterion. IEEE Transactions on Automatic Control AC-32(12), 1100–1101.


Uffink, J. (1995): Can the maximum entropy principle be explained as a consistency requirement ? Studies in History and

Philosophy of Modern Physics 26B(3), 223–261.

Uffink, J. (1996): The Constraint Rule of the Maximum Entropy Principle Studies in History and Philosophy of Modern

Physics 27B(1), 47–79.

Wragg, A. and D.C. Dowson (1970): Fitting Continuous Probability Density Functions over [0,∞) using Information Theory

Ideas. IEEE Transactions on Information Theory IT-16, 226–230.

Estimation by Fitting Generalized Moments

Documents

Transcript of Estimation by Fitting Generalized Moments