Extreme values, invariance and selection probabilities

EXTREME VALUES, INVARIANCE AND SELECTION

PROBABILITIES

Lars-Göran Mattsson∗ and Jörgen W. Weibull

December 31, 2010.

Abstract. Consider a finite collection of random variables, each rep-

resenting some relevant property of an alternative or component, such as its

utility, cost, material strength or reliability. Suppose that there is a process

that selects exactly one of these alternatives, an alternative with an extreme

(maximal or minimal) value. Among practitioners, it is widely believed that

in order to analyze such situations, one has to resort to particular paramet-

ric forms of the underlying probability distributions. We here show that this

is not needed. Indeed, parametric forms impose unnecessary theoretical and

empirical constraints. Moreover, in the case of statistical independence, we

provide necessary and sufficient conditions for all random variables – repre-

senting the relevant property of the alternatives – to have the same conditional

distribution, when selected. The distribution associated with an alternative,

conditional upon being selected, is thus invariant across alternatives.

Keywords: Extreme values, discrete choice, selection probabilities, random

utility, invariance, reliability.

1. Introduction

Consider a finite collection of random variables, each representing some relevant prop-

erty of an alternative, such as the utility or cost associated with a choice alternative

to a consumer or producer, or the material strength or reliability of components of

a machine or structure. Suppose that there is a process that selects exactly one of

these alternatives/components, and that this process is based either on maximization

(of utility) or minimization (of cost or strength). In economics and transportation re-

search, the situation can be that of a consumer, firm, or commuter, who either strives

to maximize his or her utility or profit, or minimize some cost. In material science,

the situation can be that of a machine or structure that consists of several compo-

nents, where each component may have its particular characteristics, and where the

∗The authors thank professor P.O. Lindberg for helpful comments to an earlier draft.

1

EXTREME VALUES, INVARIANCE AND SELECTION PROBABILITIES 2

machine or structure fails as soon as one component fails. We call the probability for

a given alternative/component to be selected its selection probability.

This paper is focused on the task of analyzing a class of such situations in which

it is desirable to have a mathematical model of the situation at hand that pro-

vides selection probabilities that (D1) can easily be expressed in terms of observable

attributes of the alternatives/components and that permit closed-form maximum-

likelihood estimations of underlying parameters, (D2) can be derived from an explicit

maximization or minimization procedure that can be anchored in existing theory

(microeconomics, random utility theory, reliability theory), and (D3) are analytically

well-behaved for predictions of effects from changes in relevant attributes of the al-

ternatives/components, or of the number of alternatives/components, effects both on

selection probabilities and on the achieved (maximum or minimum) value distribu-

tions (of utilities, costs, or reliability).

Among practitioners it appears to be widely believed that in order to achieve

these desiderata, one has to resort to particular parametric forms of the underlying

probability distributions; notably the Gumbel (or doubly exponential) distribution in

the case of maximization and the Weibull distribution in the case of minimization.1

In recent years also other parametric forms have been suggested and in some cases

used, see e.g. Li (2010). We here place this literature within a unified framework

and establish general results. In particular, we show that parametric representations

are not needed. Indeed, they impose unnecessary constraints, both theoretically and

empirically. Moreover, in the case of statistical independence, we provide necessary

and sufficient conditions for an invariance property – that the conditional random

variables associated with the different alternatives all have the same probability dis-

tribution, conditional upon being selected.

To the best of our knowledge, the closest works to ours are Li (2010) and Fosgerau,

McFadden and Bierlaire (2010). Li (2010) develops a similar approach to ours and

establishes some of our results for the special case of minimization and statistical

independence. Fosgerau et al. (2010) assume additivity and obtain a characterization

of selection probabilities in terms of partial derivatives of a value function, in the

same spirit as demand is related to partial derivatives of indirect utility functions in

economics (Roy’s identity). By contrast, we do not assume additivity and instead

obtain a representation in terms of partial derivatives of an aggregation function

– which is a value function only in special cases. Moreover, Fosgerau et al. do

not analyze the invariance property, for which we, as mentioned above, obtain a

characterization.

1In economics, McFadden (1974) provides a theoretical underpinning for the use of the Gumbel

distribution in the case of utility maximization, and Castillo et al. (2008) and Fosgerau and Bierlaire

(2009) provide theoretical underpinnings for the use of the Weibull distribution in the case of cost

minimization.


The presentation is organized as follows: The modelling framework is developed in

Section 2, the main results for maximization in the general case of statistical depen-

dence, Theorem 1, as well as the invariance result in the case of statistical indepen-

dence, Theorem 2, are given and proved in Section 3, along with their counterparts

for minimization, Theorems 3 and 4. Section 4 treats the special case of statisti-

cal independence among the alternatives and shows how established results follow

from these theorems, for well-known parametric families of probability distributions

(Fréchet, Gumbel, Pareto, Weibull). Section 5 considers how the present approach

can be applied to situations with statistical dependence among the alternatives, while

Section 6 briefly discusses maximum-likelihood estimation issues. Section 7 concludes

the discussion and the Appendix at the end provides some technicalities.

2. The model

Let N be the positive integers, let R be the reals and R+ the positive reals. LetP =(ΩS ) be a probability space, where Ω is a sample space, S a sigma-algebraon Ω and a probability measure on S. For any ∈ N, let 1 be random

variables in P =(ΩS ), where each : Ω → , for some open interval ⊂ R,is -measurable. Let X : Ω → be defined by X () = (1 () ()). The

random variable represents some relevant characteristic of alternative ∈ =

1 . Given the random vector X, let : Ω→ be the random variable defined

by

() = max∈

() ∀ ∈ Ω

By the selection probability for any alternative ∈ , under maximization, we simply

mean

= Pr

∙ ∈ argmax

∈

¸= Pr

h =

i

For any ∈ , let

Ω =n ∈ Ω | () = ()

o

and define the conditional random variable : Ω → by () = () ∀ ∈ Ω.

Granted (Ω) 0, this defines as a random variable in the probability space

P = (ΩS ), where S ⊂ S is the restriction of the sigma-algebra S to Ω ⊂ Ω,

and is the restriction of the probability measure to Ω, obtained by Bayes’ law:

() = ( ∩ Ω) (Ω) ∀ ∈ S.Let F be the class of cumulative probability distribution functions (c.d.f:s) :

→ [0 1], where is absolutely continuous with everywhere positive density =

0. Hence, is strictly increasing on its domain .2 A function : R+ → R+

2More exactly, if ≤ 0 for all ∈ and 0 for some ∈ , then () (0).


is homogeneous of degree l if () = () for all ∈ R+ and scalars

0.3 Let G ⊂ F be those c.d.f:s ∈ F for which there exist a c.d.f. Φ ∈ F1,a positive parameter vector = (1 ), and a continuously differentiable and

linearly homogenous (homogeneous of degree 1) function : R+ → R+ such that

() = −[−1 lnΦ(1)− lnΦ()] ∀ ∈ (1)

Necessary and sufficient conditions for such a function to be a c.d.f. can be

derived from a slight generalization of a result in Smith (1984). Let01

, for ≤ ,

denote the ’th strictly mixed partial derivative of with respect to the arguments

1 :

Proposition 1. Let : R+ → R+ be continuously differentiable and linearly ho-

mogenous, Φ ∈ F1 and ∈ R+. The function : → [0 1] defined in equation (1)

belongs to G if and only if (i) and (ii) hold, where(i) lim

→∞() = +∞ for each ∈ ,

(ii)(−1)−101

≥ 0 for any 1 ∈ N such that 1 2 ≤ .

(See Appendix for a proof.)

We will call Φ the seed c.d.f., the vector of shape parameters and the ag-

gregation function. In typical applications, the shape parameters are assumed to

be functions, according to some parametric form (usually linear), of observable at-

tributes of the alternatives at hand, while the aggregation function is assigned a

parametric form intended to represent potential statistical dependence between the

random variables 1 (see Section 5).

We finally recall that a function Φ ∈ F1 defined on an open interval ⊂ R is

a cumulative probability distribution function if and only if it is non-decreasing and

right-continuous with lim→inf Φ() = 0 and lim→sup Φ() = 1. By definition, Φ

has everywhere positive density and is thus strictly increasing on its domain. Ob-

viously, if Φ is a c.d.f. then so is Φ, for any positive scalar .4 Moreover, let denote the unit -vector (all components zero except for the ’th, which is equal to 1).

Then the ’th marginal distribution is lim→sup ∀ ∈ 6= (1 ) = Φ()().

Defining () = (11 ), with = 1(), one has () = 1 ∀ ∈ .

Hence, the rescaled c.d.f. , defined by () = −[−1 lnΦ(1)− lnΦ()] ∀ ∈ ,

belongs to G and its ’th marginal distribution is Φ , for each .

3If : R+ → R+ is homogeneous of degree ∈ N and continouosly differentiable, then its

first-order partial derivatives , are all homogeneous of degree − 1.4Here Φ : → [0 1] is defined by Φ () = [Φ ()]

∀ ∈ .


3. The main results

We are now in a position to state and prove our main result for the maximization

case:

Theorem 1. If X : Ω→ has a c.d.f. ∈ G for some Φ, and , then

Pr

∙ ∈ argmax

∈

¸=

·0()

()∀ ∈ (2)

and all random variables 1 have the c.d.f. ∗ given by

∗() = [Φ()]() ∀ ∈ (3)

Proof:5 We first derive the c.d.f. ∗ for . For any ∈ :

∗() = Prh ≤

i= ( ) = −[−1 lnΦ()− lnΦ()]

= ()· lnΦ() = [Φ()]()

where the fourth equality uses the linear homogeneity of .

Let and ∗ denote the probability density functions of Φ and ∗. Let =Pr [ ∈ argmax]. Using the fact that

∗ = Φ() is the c.d.f. of with density

∗ = ()Φ()−1, we have, for any ∈ :

Prh ≤

i=

= Pr [ ≤ | ≥ ∀ ∈ ] =Pr [ ≤ ∧ ≥ ∀ ∈ ]

Pr [ ≥ ∀ ∈ ]

=1

Z∈:≤

Pr [ ∈ ( + ) ∧ ≤ ∀ ∈ ]

=1

Z∈:≤

0( )

=1

Z∈:≤

()

Φ()·0

[−1 lnΦ() − lnΦ()] · −[−1 lnΦ()− lnΦ()]

= ·0

()

·() ·Z∈:≤

() [Φ()]()−1

()

= ·0

()

·() ·Z∈:≤

∗ () = ·0

()

·() · ∗()

5We thank P.O. Lindberg for suggestions that shortened our proof.


The sixth equality uses the homogeneity of degree one of and of degree zero of 0.

For = sup, Prh ≤

i= ∗() = 1. Hence,

= ·0

()

()

Thus, for any ∈ : Prh ≤

i= ∗(). Q.E.D.

While selection probabilities in general need not sum to one, they actually do

sum to one in our setting – with an absolutely continuous c.d.f. – since then the

probability of a tie is zero. We also note that the selection probabilities do not depend

on the choice of seed function Φ, they only depend on the aggregation function

and the shape-parameter vector . By contrast, the distribution function ∗ for theextreme (maximum) value, does depend on the seed function.

We will refer to the fact that all conditional random variables 1 have the

same probability distribution as the invariance property. An outside observer who,

for some given alternative ∈ , registers the realizations of whenever that alter-

native is selected in a repeated experiment, will find the same empirical conditional

distribution as an outside observer who does the same for any other alternative . The

observed empirical conditional distribution is invariant across the set of alternatives.

In general, if 1 have c.d.f:s ∈ F1, then

∗ () =X=1

∗ () (4)

where is the selection probability for each alternative . To see this, note that

∗() = Prh ≤

i=X

Pr [ ≤ ∧ ≥ ∀]

=X

Pr [ ≥ ∀] · Pr [ ≤ | ≥ ∀] =X

∗ ()

where ∗ is the c.d.f. of . In the special case of i.i.d. random variables, this

observation immediately implies that has the same probability distribution as ,

since all selection probabilities and conditional distributions are then the same. The

above theorem provides more general conditions for this invariance property to hold.

That invariance holds for independent Gumbel distributed random variables (see

Section 4.2) was shown already by Strauss (1979) and further elaborated by Robertson

and Strauss (1981). That the invariance property holds for the expected values of

independent Gumbel distributed random variables was rediscovered by Anas and


Feng (1988). Lindberg et al. (1995) provided a complete proof of the claims in

Strauss (1979) and Robertson and Strauss (1981) that the invariance property in

fact characterizes additive random utility models based on a slight generalization of

multivariate extreme value distributions. More recently, Train and Wilson (2008) re-

derived the c.d.f:s for the conditional random variables , and de Palma and Kilani

(2007) proved that the invariance property characterizes such additive random utility

models when the random variables are independent Gumbel.

3.1. Statistical independence. In many applications, it is natural to assume

statistical independence, a situation to which our next main result applies. Let thus

1 be statistically independent random variables with marginal c.d.f:s () =

[Φ ()]. The c.d.f. of = (1 ) can then be expressed in product form as

() =

Y=1

() = 1 lnΦ(1)++ lnΦ()

In other words, the aggregation function is now simply summation:

() = 1 + + ∀ ∈ R+ (5)

From Theorem 1 we immediately obtain:

Pr

∙ ∈ argmax

∈

¸=

1 + + ∀ ∈ (6)

and all random variables 1 have the same c.d.f. ∗, given by

∗() = [Φ()]1++ ∀ ∈ (7)

It turns out that, in this context of statistical independence, the invariance prop-

erty in fact characterizes the subclass G of probability distributions defined in Section2. More exactly, if the random variables 1 are statistically independent, then

they have the invariance property if and only if each random variables has a c.d.f.

of the form Φ for some common seed c.d.f. Φ and positive shape parameters .

This result can be obtained by an application of Lemma 3.2 in Resnick and Roy

(1990b) and it is, as mentioned above, a kind of generalization of two results for

additive random-utility models, due to Lindberg, Eriksson and Mattsson (1995) and

de Palma and Kilani (2007), respectively.

Theorem 2. Let 1 be statistically independent with c.d.f:s ∈ F1 ∀ ∈ .

All conditional random variables have the same probability distribution if and

only if there is some Φ ∈ F1 and ∈ R+ such that = Φ ∀ ∈ .


Proof: The “if” claim follows directly from Theorem 1.

In order to prove the “only if” claim, first consider the case = 2. Suppose, thus,

that 1 and 2 are statistically independent with c.d.f:s 1 2 ∈ F1, and supposethat that 1 and 2 have the same c.d.f. In force of equation (4) we then have

∗1 = ∗2 = ∗ = 12. For any ∈ :

Pr [1 ≤ 2 ≤ ] = Pr [1 ≤ 2] · Pr [2 ≤ | 1 ≤ 2]

= Pr [1 ≤ 2] · ∗2 () = 2 · 1 ()2 ()

where 2 = Pr [1 ≤ 2] ∈ (0 1). Hence, by Lemma 3.2 in Resnick and Roy (1990b):1 =

(1−2)22 (see Appendix for a proof of this result adapted to our setting).

For an arbitrary 2, suppose that 1 are statistically independent

with c.d.f:s ∈ F1, and assume that 1,..., all have the same c.d.f. Then

∗1 = = ∗ = ∗ =Q=1

(again by (4)). Consider an arbitrary ∈ 1 .We clearly have = max where = max

∈6= and the c.d.f. of is

=Q

∈6=. Since the subset Ω

1 ⊂ Ω where ≥ , is the same as the subset

Ω ⊂ Ω where ≥ for all ∈ , we conclude that the conditional random variable

, defined on Ω1 by () = () ∀ ∈ Ω1, is identical with the conditional

random variable , and hence the c.d.f. of is also ∗ = ∗. Let the conditional

random variable be defined on Ω\Ω1 by () = (), and let ∗ denote its

c.d.f. Applying equation (4) again, we obtain, with = Pr( ≥ ), the equation

∗ = (1 − )∗ +

∗, and hence ∗ = ∗. From the above result for = 2 we

thus obtain

1−

= =Q

∈6=

Multiplying both sides by : 1 =

Q∈

= ∗, or = ( ∗) . Since this

holds for any ∈ 1 , the claim has been established, with Φ = ∗ ∈ F1 and = 0. Q.E.D.

In this case of statistical independence, equation (4) provides an intuition for the

invariance property. Consider two statistically independent random variables,1 ∼ Φ

and 2 ∼ Φ for = 2, and let 1 and 2 be the associated conditional random

variables. By (7), 2 has the same probability distribution as max 21 22, where21 and 22 are i.i.d. Φ. These two new random variables can be taken to be

statistically independent of 1. Thus has the same probability distribution as the

maximum, , of three i.i.d. Φ random variables, 1, 21 and 22. By equation (4),

and the fact that the three constituent random variables are i.i.d., the corresponding


conditional random variables, 1, 21 and 22, must all have the same distribution

as . Moreover, 1 has the same distribution as 1, so 1 ∼ . Again by equation

(4), 2 ∼ . In other words, when the shape parameters are integers, then we can

replace each random variable by i.i.d. random variables, and the invariance

property follows from equations (7) and (4). The same argument applies when the

shape parameters are rational numbers: let ∈ N be their smallest common divisor,so that = where each is an integer. Then each random variable can be

thought of as the maximum of i.i.d. random variables, etc.

The invariance obtained in Theorem 2 does not necessarily hold if the hypothesis

is relaxed. This is illustrated in the following example.

Example 1. Let 1 ∼ 1 = (0 1) and 2 ∼ 2 = ( 1) be statistically inde-

pendent. For 6= 0, the hypothesis of Theorem 2 is violated: 2 6= 1 for all 0.

Let = max 1 2 and define the conditional random variable , for = 1 2, as

above. The densities of (thick curve) and 1 (dashed) are shown in the diagram

below, for = 1. (The thin curves are the densities for 1 and 2.)

-2 -1 0 1 2 3 4

0.1

0.2

0.3

0.4

0.5

x

y

Figure 1: Example of failure of the invariance property.

The invariance principle may be illustrated by means of a simple example.6 Let,

thus, 1 and 2 be statistically independent, with c.d.f:s Φ1 and Φ2, where =

() for some strictly decreasing function . The two random variables represent

the utility of two alternatives to consumers in a population, and is some aspect of

alternative that contributes to its disutility for all consumers (say “price” or “travel

time”). Each consumer ∈ Ω selects the alternative with the highest utility for him or

her, and thus achieves utility () = max 1 () 2 (). The population average6This example is discussed more thoroughly in Anas and Feng (1988), Resnick and Roy (1990a),

and Lindberg et al. (1995).


utility is thus Ehi. According to the invariance principle, the utility distribution

for those consumers who choose alternative 1 is identical with that for those who

choose alternative 2. (In particular: Eh1

i= E

h2

i= E

hi.)

Suppose now that alternative 1 is improved, by way of a reduction of 1. What will

happen? Some consumers may abandon alternative 2 and now turn to the improved

alternative 1 instead. However, according to the invariance principle, the two new

conditional utility distributions – for those who now choose alternative 1 and 2,

respectively – will still be identical. Moreover, the average utility in the population

has gone up. So have those who still choose alternative 2 gained from the improvement

of alternative 1? Clearly not. The explanation is that fewer consumers now choose

alternative 2, and these are the consumers who appreciate alternative 2 most, those

who still prefer it over alternative 1, even after the improvement of alternative 1. By

contrast, those who initially were only a little bit better off with alternative 2 have

now switched to alternative 1. Hence, although the average utility for those who still

choose alternative 2 has gone up, the utility to each such consumers is unchanged.

3.2. Minimization. In many situations, it is minimization, rather than max-

imization that underlies the selection process. Let, as before, P =(ΩS ) be aprobability space and suppose that 1 , are random variables on this space,

where each : Ω → for some open interval ⊂ R, and define : Ω →

by () = min∈ (). To define the corresponding conditional random variables

1 : for any , let

Ω =© ∈ Ω | () = () ∀ª

and define the conditional random variable : Ω → by () = () ∀ ∈ Ω

Granted (Ω) 0, this defines as a random variable in the probability space

P = (ΩS ) where S and are defined as before.

LetH be the class of survival functions : → [0 1], where 1− is some c.d.f.

in F, and for which there exists a seed survival-function Ψ = 1−Φ, for some Φ ∈ F1,a positive shape parameter vector = (1 ) and a continuously differentiable

and linearly homogeneous aggregation function : R+ → R+ such that

() = −[−1 lnΨ(1)− lnΨ()] ∀ ∈

Equipped with this notational machinery, we obtain our main result for the case

of minimization:


Theorem 3. If X : Ω → has a survival function ∈ H for some Ψ, and ,

then

Pr

∙ ∈ argmin

∈

¸=

·0()

()∀ ∈ (8)

and all random variables 1 have the survival function ∗ given by

∗() = [Ψ()]() ∀ ∈ (9)

Proof: We first derive the survival function ∗ for . For any ∈ :

() = Pr£ ≥

¤= ( ) = −[−1 lnΨ()− lnΨ()]

= ()· lnΨ() = [Ψ()]()

where the fourth equality uses the linear homogeneity of .

Let and ∗ denote the probability density functions of the survival functionsΨ and ∗, i.e. = −Ψ0 and ∗ = − (∗)0, respectively, and 0 and 0

the partial

derivatives with respect to the ’th argument of and , respectively, and define

= Pr [ ∈ argmin]. Using the fact that ∗ = Ψ() is the survival function of

and hence its density is ∗ = − (∗)0 = ()Ψ()−1, we have for any ∈ :

Pr£ ≥

¤=

= Pr [ ≥ | ≤ ∀ ∈ ]

=Pr [ ≥ ∧ ≤ ∀ ∈ ]

Pr [ ≤ ∀ ∈ ]= − 1

Z∈:≥

0( )

=1

Z∈:≥

−[−1 lnΨ()− lnΨ()] ·0 [−1 lnΨ() − lnΨ()] ·

£Ψ()−1

¤()

= ·0

()

·() ·Z∈:≥

() [Ψ()]()−1

()

= ·0

()

·() ·Z∈:≥

∗ () = ·0

()

·() ∗()

The fourth equality uses the homogeneity of degree one of and of degree zero of

0. For = inf, Pr

£ ≥

¤= ∗ () = 1 and hence

= ·0

()

()


Thus, for any ∈ : Pr£ ≥

¤= ∗(). Q.E.D.

Just as in the maximization case, the selection probabilities, here defined by =

Pr [ ∈ argmin∈ ], are independent of what seed function (Ψ) is used, while the

survival function ∗ for the minimum and for all conditional minima, do depend on

the seed function.

3.3. Statistical independence. We next turn to minimization in the indepen-

dence case. Let thus 1 be statistically independent random variables with

marginal survival functions () = [Ψ ()]. The survival function for X =

(1 ) can then be expressed in product form as

() =

Y=1

() = 1 lnΨ(1)++ lnΨ()

The aggregation function is thus simply summation, see equation (5). From The-

orem 3 we immediately obtain:

Pr

∙ ∈ argmin

∈

¸=

1 + + ∀ ∈ (10)

and all random variables 1 have the same survival function ∗, given by

∗() = [Ψ()]1++ ∀ ∈ (11)

Just as in the case of maximization, the invariance property characterizes the

subclass H of survival functions of the above form. More exactly, if the random

variables 1 are statistically independent, then they have the invariance prop-

erty if and only if each random variables has a survival function of the form Ψ

for some common seed survival function Ψ and positive shape parameters . This

result can be obtained along the same lines as Theorem 2. We also need the counter-

part to equation (4) for the minimization case. In general, if 1 have survival

functions such that 1− ∈ F1, then:

∗ () =X=1

∗ () (12)

where is the selection probability for each alternative :

∗() = Pr£ ≥

¤=X

Pr [ ≥ ∧ ≤ ∀]


=X

Pr [ ≤ ∀] · Pr [ ≥ | ≤ ∀] =X

∗()

where ∗ is the survival function of .

Theorem 4. Let 1 be statistically independent with survival functions such that = 1 − ∈ F1 ∀ ∈ . All conditional random variables have the

same probability distribution if and only if there is some survival function Ψ with

1−Ψ ∈ F1 and ∈ R+ such that = Ψ ∀ ∈ .

Proof: The “if” claim follows directly from Theorem 3.

In order to prove the “only if” claim, first consider the case = 2. Suppose, thus,

that 1 and 2 are statistically independent with survival functions 1 2 such that

1 − ∈ F1, and suppose that that 1 and 2 have the same survival function. In

force of equation (12) we then have ∗1 = ∗2 = ∗ = 12. For any ∈ :

Pr [1 ≥ 2 ≥ ] = Pr [1 ≥ 2] · Pr [2 ≥ | 1 ≥ 2]

= Pr [1 ≥ 2] · ∗2 () = 2 · 1 ()2 ()

where 2 = Pr [1 ≤ 2] ∈ (0 1). Lemma 3.2. in Resnick and Roy (1990b) is

formulated for the maximization case. It is readily checked that a corresponding

result holds for minimization: 1 = (1−2)22 .

For an arbitrary 2, suppose that 1 are statistically independent with

survival functions such that 1− ∈ F1, and assume that 1,..., all have the

same survival function. Then ∗1 = = ∗ = ∗ =Q=1

(again by (12)). Consider

an arbitrary ∈ 1 . We clearly have = min where = min∈6=

and the survival function of is =Q

∈6=. Since the subset Ω

1 ⊂ Ω where

≤ , is the same as the subset Ω ⊂ Ω where ≤ for all ∈ , we conclude

that the conditional random variable , defined on Ω1 by () = () ∀ ∈ Ω1,

is identical with the conditional random variable , and hence the survival function

of is also ∗ = ∗. Let the conditional random variable be defined on Ω\Ω1 by

() = (), and let ∗ denote its survival function. Applying again equation (12),we obtain, with = Pr( ≤ ), the equation ∗ = (1− )

∗ +

∗, and hence∗ = ∗. From the above result for = 2 we thus obtain

1−

= =Q

∈6=


Multiplying both sides by : 1 =

Q∈

= ∗, or = (∗) . Since this holds

for any ∈ 1 , the claim has been established, with Ψ = ∗ where 1−∗ ∈ F1and = 0. Q.E.D.

4. Parametric distributions

Earlier established results for the Fréchet, Gumbel, Pareto and Weibull distributions

can now be obtained as special cases. We recall that the gamma function Γ has

domain R+, is defined by

Γ() =

Z ∞

0

−1−

and, for positive integer arguments, Γ() = ( − 1)!. We consider the parametricfamilies in alphabetic order.

4.1. Fréchet. A random variable is Fréchet distributed with parameters 0

and 0, or Fréchet( ) for short, if its distribution function has domain = R+and

() = exp

∙−³

´¸∀ ∈ R+

We note that () is expressible in the form

() =

"exp

Ã−µ1

¶!#

= Φ()

where Φ is the c.d.f. for Fréchet( 1) and = . Hence, if random variables

1 are Fréchet distributed with common but possibly different ’s, then

we may write = ()and apply Theorem 2, and equations (6) and (7), to obtain

Proposition 2. Let1 be statistically independent, with each Fréchet( ),

for 1 0. Then

Prh ∈ argmax

i=

P

∀ ∈

and 1 are all Fréchet( ∗) where

∗ =

ÃX

!1


Proof: By the above, all have c.d.f:s of the form Φ for = (). Theorem 2

and equations (6) and (7) together give the expression for the selection probabilities,

as well as the c.d.f. of all 1 : ∗ = ΦΣ, or, equivalently, Fréchet( ∗)

with ∗ as stated in the proposition. Q.E.D.

We recall that the expected value of a Fréchet( ) distributed random variable

is defined when 1, in which case it is Γ(1− 1). Hence, for 1:

Ehi= E

h1

i= = E

h

i= ∗Γ(1− 1)

We also note that if a random variable is Fréchet( ) then = for

Fréchet( 1). Hence, in applications of this machinery, one may view each random

variable as the product , where 0 is a parameter and is Fréchet( 1).

4.2. Gumbel. A random variable isGumbel distributed with parameters 0

and ∈ R, or Gumbel( ) for short, if its distribution function has domain = Rand

() = exp¡−−(−)¢ ∀ ∈ R


() =£exp

¡−−¢¤ = Φ()

where Φ is the c.d.f. for Gumbel( 0) and = Hence, if random variables

1 are Gumbel distributed with common but possibly different ’s, then

we may write = and apply Theorem 2, and equations (6) and (7), to obtain

Proposition 3. Let1 be statistically independent, with each Gumbel( ),

for 0. Then

Prh ∈ argmax

i=

exp()P exp()

∀ ∈

and 1 are all Gumbel( ∗) where

∗ =1

lnX

exp()

Proof: By the above, all have c.d.f:s of the form Φ for = . Theorem 2

and equations (6) and (7, together give the expression for the selection probabilities,


as well as the c.d.f. of all 1 : ∗ = ΦΣ, or, equivalently, Gumbel( ∗)

with ∗ as stated in the proposition. Q.E.D.

This is the much used and versatile multinomial logit model (McFadden, 1974).

We recall that the expected value of a Gumbel( ) distributed random variable is

+ , where is Euler’s constant. Hence:

Ehi= E

h1

i= = E

h

i= ∗ +

We also note that if a random variable is Gumbel( ) then = + for

Gumbel( 0). Hence, in applications of this machinery, one may view each random

variable as the sum + , where is a parameter and is Gumbel( 0).

4.3. Pareto. A random variable is Pareto distributed with parameters 0

and 0, or Pareto( ) for short, if its survival function has the domain =

(+∞) and() = ()

∀ ∈ (+∞) .We note that () is expressible in the form

() = Ψ()

whereΨ is the survival function forPareto( 1). Hence, if random variables1

are Pareto distributed with common but possibly different ’s, then we may write

= and apply Theorem 4, and equations (10) and (11), to obtain

Proposition 4. Let1 be statistically independent, with each Pareto( ),

for 1 0. Then

Prh ∈ argmin

i=

P

∀ ∈

and 1 are all Pareto(Σ)

Proof: By the above, all have survival functions of the form Ψ. Theorem 4

and equations (10) and (11) together give the expression for the selection probabilities,

as well as the survival function of all 1 : ∗ = ΨΣ, or, equivalently,

Pareto(Σ) as stated in the proposition. Q.E.D.

We recall that the expected value of a Pareto( ) distributed random variable

is defined when 1, in which case it is (− 1). Hence, for Σ 1:E£¤= E

£1

¤= = E

£

¤= Σ(Σ − 1)


We also note that if a random variable is Pareto( ) then = 1 for

Pareto(1 1). Hence, in applications of this machinery, one may view each random

variable as proportional with the factor to the ’th root of the random variable

, where is Pareto(1 1).

4.4. Weibull. A random variable isWeibull distributed with parameters 0

and 0or Weibull( ) if its survival function has the domain = R+ and

() = exp

∙−³

´¸


() =£exp

¡−¢¤− = Ψ()

where Ψ is the survival function for Weibull( 1) and = −. Hence, if randomvariables 1 are Weibull distributed with common but possibly different

’s, then we may write = ()−and apply Theorem 4, and equations (10) and

(11), to obtain

Proposition 5. Let1 be statistically independent, with eachWeibull( ),

for 1 0. Then

Prh ∈ argmin

i=

()−P

()− ∀ ∈

and 1 are all Weibull( ∗) where

∗ =

ÃX

−

!−1

Proof: By the above, all have survival functions of the formΨ for = ()−.

Theorem 4 and equations (10) and (11) together give the expression for the selection

probabilities, as well as the survival function of all 1 : ∗ = ΨΣ, or,

equivalently, Weibull( ∗) with ∗ as stated in the proposition. Q.E.D.

We recall that the expected value of a Weibull( ) distributed random variable

is Γ(1 + 1). Hence:

E£¤= E

£1

¤= = E

£

¤= ∗Γ(1 + 1)


We also note that if a random variable is Weibull( ) then = for

Weibull( 1). Hence, in applications of this machinery, one may view each random

variable as the product , where 0 is a parameter and is Weibull( 1).

As noted in the introduction, Castillo et al. (2008), Fosgerau and Bierlaire (2009)

and Li (2010) provide applications of theWeibull distribution to discrete choice, based

on random cost minimization.

5. Statistical dependence

Consider the following class of continuously differentiable and linearly homogenous

functions :

() =

ÃX=1

!1∀ ∈ R

+ (13)

where are positive scalars and ≥ 1. Clearly, statistical independence between thevariables amounts to the special case when = 1; then (13) boils down to (5). It

follows from Proposition 1 that the associated function , defined in equation (1), is

a c.d.f. The partial derivative of () with respect to is

0 () = ()

1− · −1

so if X = (1 ) has a c.d.f. ∈ G for some Φ, and such an , then

Pr

∙ ∈ argmax

∈

¸=

P

=1

∀ ∈ (14)

We note that these selection probabilities are independent of the choice of seed

function Φ. Hence, there is no reason to restrict the distribution functions for the

random variables to be of a particular parametric form, as long as their joint c.d.f.

is expressible in the form of equation (1).7

We also note that while the case when the aggregation parameter takes on its

minimal value, 1, represents statistical independence between the alternatives, in the

opposite extreme case, when → +∞, the selection probabilities are the same asif the alternatives would be maximally positively correlated, or, alternatively, as a

deterministic choice of the alternative with the highest expectation:

max6=

⇒ lim→+∞

Pr

∙ ∈ argmax

∈

¸= 1

7Equation (14) also reveals that the selection probabilities are identical with those for statistically

independent random variables (1 ), and hence () = 1++, with the same seed function

Φ, but with shape vector 0 = (01 0) where

0 =

∀. In this sense, the form of statistical

dependence permitted by (13) is observationally equivalent with statistical independence among

random variables with suitably changed marginal distributions.


5.1. Dependence within clusters of alternatives. Suppose thatX = (1 )

has a c.d.f. ∈ G for some seed function Φ ∈ F1, positive parameter vector = (1 ), and continuously differentiable and linearly homogenous function

of the form

() =

ÃX=1

!1+

ÃX

=+1

!1∀ ∈ R

+

for some ≥ 1. That actually is c.d.f. follows from Proposition 1. When both

parameters are positive, there is statistical dependence within the cluster that

consists of the first random variables and statistical dependence within the cluster

that consists of the last − variables. Let = 1 and = + 1 ,and write (1 ) = (1 ) + (+1 ). From the above we then

have, for each ∈ :

= Pr

∙ ∈ argmax

∈

¸=

·0()

()(15)

=(1 )

(1 ) +(+1 )·

P

=1

=1

1 + ()·

P

=1

(and similarly for ∈ ), where

() =(+1 )

(1 )=

¡+1 + +

¢1(

1 + +

)

1 0

The second factor in (15) is the conditional selection probability for alternative ,

given that first group has been selected (c.f. equation (14)). Hence, the first factor

in (15) is the probability that an alternative in group will be selected:8

Pr

∙ ∈ argmax

∈

¸= Pr

∙ ⊂ argmax

∈

¸· Pr

∙ ∈ argmax

∈ | ⊂ argmax

∈

¸= Pr

∙ ⊂ argmax

∈

¸· Pr

∙ ∈ argmax

∈

¸A special case of this is the so-called nested logit model (McFadden, 1978), whereby

first a group of alternatives is selected and thereafter an alternative within the selected

8It also follows that

Pr

∙ ⊂ argmax

∈

¸=

()

1 + ()


group. In that special case, the seed function is assumed to be Gumbel, while here

no parametric assumption has been made concerning the seed function.

With the same shape parameter for each alternative and equally large groups

( = 2) we obtain () = 1−1. In particular, if the statistical dependencewithin each group is equally large in the sense of = , then () = 1, as one would

expect. We also note that () = (− 1)1 in the case of groups of arbitrarysize, but when all alternatives have the same shape parameter and the statistical

dependence within each group is equally large. One example of such a situation is

the so-called blue-bus red-bus paradox, to which we now turn.9

5.2. The blue-bus red-bus paradox. Let = 3 and = 1, and call alternative

1 “automobile”, alternative 2 “blue bus” and alternative 3 “red bus.” Thus = 1and = 2 3 and

() =

∙µ2

1

¶

+

µ3

1

¶¸1Consider the special case when 1 = 2 = 3 = 0. Then

1 =1

1 + 21

Hence, the selection probability for alternative 1 (“automobile”) is increasing in ,

from 13 in the boundary case = 1 of statistical independence among alternatives 2

and 3, towards 12 in the limit case → +∞ when alternatives 2 and 3 are maximally

positively correlated.

5.3. Two-stage selection of alternatives. Next, we consider in some more

detail the two-stage selection procedure of first choosing a set of alternatives and then

an element within the chosen set. Consider first, the choice of an alternative from

within the subset . The c.d.f. of the accordingly restricted random vector X =

(1 ) is then ∈ G, with seed function Φ, shape-parameter vector =

(1 ), and linearly homogenous aggregation function Hence, the conditional

selection probabilities for alternatives ∈ are:

| = Pr

∙ ∈ argmax

∈

¸=

·

(1 )

(1 )=

P

=1

The random variable = max 1 , and the associated conditional randomvariables

1 all have the c.d.f. Φ

(1), and similarly for the alternatives

in .

9The first to note this paradox seems to be Debreu (1960), although he phrased it in terms of

music choice, rather than transportation choice.


Consider now the selection between the “menus” of alternatives, and , where

the choice of menu is followed by a choice of the maximal (“best”) item on the so cho-

sen menu. Obviously, the random vector = ( ) has a c.d.f. ∈ G2 with

seed function Φ, shape-parameter vector =¡(1 )

(+1 )¢,

and the linearly homogenous aggregation function is summation, (1 2) =

1 + 2. Thus (1 2) = [Φ(1)]

(1) [Φ(2)](+1). Again by Theo-

rem 1, the selection probability for the menu of alternatives , from the binary menu

collection , is

| = Prh ∈ argmax

n

oi=

(1 )

(1 ) +(+1 )=

1

1 + ()

The random variable = max( ), and the corresponding conditional ran-

dom variables, defined by ∗ = on the subset Ω where = , and

∗ = on the subset Ω where = , respectively, all have the same

c.d.f. Φ(1)Φ(+1).

All random variables ∗

∗ 1 have the same c.d.f. Φ

(1)··Φ(+1). In particular, the expected values of all these random variables are

the same. Hence if the selection probabilities concern utility-maximizing individual

choices in a population, then the average achieved utility is the same for each of the

two subpopulations, that is, for those who chose the best alternative within and

for those who chose the best alternative in , and this average is also the same for

each sub-subpopulation associated with any individual alternative , as selected from

the full set of alternatives . This two-stage selection procedure can be extended to

any number of selection levels.

Remark 1. Just as in the preceding subsections, this analysis does not require any

parametric representation of the seed function. However, quantitative evaluation of

the distributions of “benefits” and “costs” associated with changes in alternatives,

and the ensuing changes in selection probabilities and expected values, requires the

analyst to specify a seed function.

6. Parameter estimation

One of the desiderata mentioned in the introduction, D2, was that the selection

probabilities should be consistent with received theory, as the result of an explicit

maximization or minimization procedure. We here briefly sketch how this can be

achieved within the present approach.


In applications, the analyst usually has access to measurable attributes of the

alternatives, such as travel time in the case of transportation choice, material compo-

sition and dimensions in the case of reliability analysis etc. As a method of analysis,

one usually specifies a parametric functional form (often affine), that maps these

observable attribute vectors to shape parameters = (). Together with a

parametric specification of the aggregation function , and a data set of outcomes,

the analyst may proceed to parameter estimation by way of the maximum-likelihood

method. We here focus on the case of selection based on maximization (the mini-

mization case is completely analogous).

Assume that the data set consists of statistically independent selection ex-

periments, ∈ = 1 , where the selection is made from the same set of

alternatives (an assumption that is made just for notational convenience). For each

experiment , let () denote the alternative selected: () ∈ argmax∈ where

the c.d.f. for the vector X = (1

) is

∈ G, defined by some seed c.d.f. Φ,

positive shape-parameter vector = (1

), for all experiments ∈ . Let the

aggregation function for experiment be continuously differentiable and linearly

homogenous, and such that

() = −[−1 lnΦ(1)− lnΦ()] ∀ ∈ £

¤

For any vector ((1) ()) of observed selections, the likelihood of the observation

is

L(1 1 )((1) ()) =

Y=1

()

³

()

´0()

()

According to the maximum-likelihood principle, the analyst seeks those parameter

values, in the chosen functional forms and , evaluated for the observed attribute

vectors, that maximize the likelihood function L. For given parametric forms of and, the likelihood function thus has a closed analytical form, amenable to analysis.

As indicated above, the set of alternatives need not be the same in all experiments

: the same principle applies also with differing sets .

7. Conclusion

We have here suggested a general, unifying and operational approach to discrete se-

lection probabilities so that these (D1) can easily be expressed in terms of observable

attributes of the alternatives at hand and permit closed-form maximum-likelihood

estimations of underlying parameters, (D2) can be derived from explicit maximiza-

tion or minimization procedures anchored in the relevant theory for the application

(microeconomics, random utility theory, reliability theory), and (D3) are analytically


well-behaved for predictions of effects from changes in relevant attributes of the al-

ternatives, or of the set of alternatives, effects both on selection probabilities and on

the achieved (maximum or minimum) value distributions (of achieved utilities, costs,

or reliability). We show that this can be done without any parametric specification,

and we show how our general results readily lead to well-known results for parametric

probability distributions as special cases. We also define and establish an invariance

property and characterize this in the case of statistical independence.

We hope that the present approach will be useful for teaching and in applica-

tions, and that it may inspire further research in this field that combines elements of

probability theory with elements of economic theory and reliability theory.

8. Appendix

8.1. Proof of Proposition 1. We prove the proposition by way of a slight gen-

eralization of Theorem 3.1 in Smith (1984).10 First, let and be open intervals

in and : → a continuous and strictly increasing surjection. Clearly, if

: → [0 1] is a c.d.f. then so is : → [0 1], defined by () = (()). Also

−1 : → is a continuous and strictly increasing surjection, so if is a c.d.f., then

so is defined by () = (−1()). Second, for each ∈ = 1 let and

be open intervals in R, and let : → be continuous and strictly increasing

surjections. By the above observation, : ×∈ → [0 1] is a c.d.f. if and only if

: ×∈ → [0 1], defined by () = (()), is a c.d.f. Here : ×∈ → ×∈

is defined by () = (1 (1) ()).

Lemma 1. Assume that : → [0 1], defined by () = −(−1 ln Φ(1)− ln Φ()),is a c.d.f. for some continuously differentiable and linearly homogenous function

: R+ → R+, ∈ R

+ and Φ ∈ F1 with domain . Then : → [0 1], defined

by () = −(−1 lnΦ(1)− lnΦ()), is a c.d.f. for any ∈ R+ and Φ ∈ F1 with

domain .

Proof: First, let : (−∞ 0) → [0 1] be defined by () = −(−1−). Foreach ∈ , let : (−∞ 0) → be defined by () = Φ−1(). Then is a

continuous and strictly increasing surjection. Moreover,

(()) = −(−1 ln Φ(1(1))− ln Φ(())) = −(−1−) = () ∀ ∈ (−∞ 0)

Since is a c.d.f., so is , by the above observation.

Second, for each ∈ , let : → (−∞ 0) be defined by () = lnΦ()

for some positive scalar ∈ R+ and some continuous and strictly increasing c.d.f.10This generalization relies heavily on suggestions made by P.O. Lindberg.


Φ : → [0 1]. Then is a continuous and strictly increasing surjection, and we

have

(()) = −(−1(1)−()) = −(−1 lnΦ(1)− lnΦ()) = () ∀ ∈

Since is a c.d.f, so is , again by the above observation. Q.E.D.

We are now ready to prove Proposition 1. Let thus : R+ → R+ be some

continuously differentiable and linearly homogenous function that satisfies conditions

(i) and (ii). The function : R → [0 1], defined by () = −(−1 −), is a

c.d.f. by Theorem 3.1 in Smith (1984). Now ∈ G with domain = R, since

() = −(−1 −) = −(− ln Φ(1)− ln Φ()) for the Gumbel c.d.f. Φ() = −

−,

and hence, by Lemma 1, any function : → [0 1], as defined in equation (1),

belongs to G.Conversely, if : → [0 1], defined by () = −(−1 lnΦ(1)− lnΦ()), is a

c.d.f. for some ∈ R+ and Φ ∈ 1, then by Lemma 1, the function : R → [0 1],

defined by () = −(− ln Φ(1)− ln Φ()), is a c.d.f. for any Φ ∈ F1, in particular forthe Gumbel c.d.f. Φ() = −

−, in which case () = −(

−1 −). By Theorem

3.1 in Smith (1984), must then satisfy conditions (i) and (ii). Q.E.D.

8.2. Proof of adapted Lemma in Resnick and Roy. Lemma 3.2 in Resnick

and Roy (1990a) applies to non-negative random variables. We here adapt their claim

and proof to the present setting:

Lemma 2 [Resnick and Roy, 1990a]. Let 1 and 2 be independent random vari-

ables with range ⊂ R, where is an open interval. Let 1 and 2 be their

absolutely continuous c.d.f:s with everywhere positive densities, and let ∈ (0 1).The following two claims are equivalent:

(i) Pr [2 ≤ 1 ≤ ] = · 1 () · 2 () ∀ ∈

(ii) 2 () = 1−

1 () ∀ ∈

Proof: We first note that

Pr [2 ≤ 1 ≤ ] =

Z∈: ≤

2 () 1 () ∀ ∈

Moreover, by Fubini’s theorem:Z∈: ≤

1 () 2 () +

Z∈: ≤

2 () 1 () = 1 () ·2 () ∀ ∈ (16)


Hence, if (i) holds, then

·Z∈: ≤

1 () 2 () = (1− ) ·Z∈: ≤

2 () 1 () ∀ ∈ (17)

and thus

· 1 · 2 = (1− ) · 2 · 1or, equivalently (both c.d.f:s being positive on their domain ):

· 22

= (1− ) · 11

Integration of this identity, from inf to an arbitrary ∈ , gives

· ln2 () = (1− ) · ln1 () ∀ ∈ (18)

which establishes claim (ii). To prove the converse, assume (ii), and hence (18). The

also (17) holds, and (i) is obtained from equation (16). Q.E.D.

References

[1] Anas, A. and Feng, C.M. (1988): “Invariance of expected utility in logit models”,

Economics Letters 27, 41-45.

[2] Castillo, E., J.M. Menéndez, P. Jiménez and A. Rivas (2008): “Closed form

expressions for choice probabilities in the Weibull case”, Transportation Research

B 42, 373-380.

[3] Debreu, G. (1960): “Review of R. D. Luce, Individual Choice Behavior: A

Theoretical Analysis”, American Economic Review 50, 186-188.

[4] de Palma, A. and K. Kilani (2007): “Invariance of conditional maximum utility”,

Journal of Economic Theory 132, 137-146.

[5] Fosgerau, M. and M. Bierlaire (2009): “Discrete choice models with multiplica-

tive error terms”, Transportation Research B 43, 494-505.

[6] Fosgerau, M., D. McFadden and M. Bierlaire (2010): “Choice probability gener-

ating functions”, MPRA Paper No. 24214, University of Munich.

[7] Li, B. (2010): “The multinomial logit model revisited: A semi-parametric

approach in discrete choice analysis”, Transportation Research B, (in press)

doi:10.1016/j.trb2010.09.007.


[8] Lindberg, P.O., E.A. Eriksson and L.-G. Mattsson (1995): “Invariance of

achieved utility in random utility models”, Environment and Planning A, 27,

121-142.

[9] McFadden, D. (1974): “Conditional logit analysis of qualitative choice behavior”,

in P. Zarembka (ed.), Frontiers in Econometrics. New York: Academic Press,

105-142.

[10] McFadden, D. (1978): “Modelling the choice of residential location”, in A. Kar-

lqvist et al. (eds.), Spatial Interaction Theory and Planning Models. Amsterdam:

North-Holland.

[11] Resnick S.I. and R. Roy (1990a): “Leader and maximum independence for a

class of discrete choice model”, Economics Letters 33, 259-263.

[12] Resnick S. and R. Roy (1990b): “Multivariate extremal processes, leader

processes and dynamic choice models”, Advances in Applied Probability 22, 309-

331.

[13] Robertson, C.A. and D. Strauss (1981): “A characterization theorem for random

utility variables”, Journal of Mathematical Psychology 23, 184-189.

[14] Smith, T.E. (1984): “A choice-probability characterization of generalized

extreme-value models”, Applied Mathematics and Computation 14, 35-62.

[15] Strauss, D. (1979): “Some results for random utility models”, Journal of Math-

ematical Psychology 20, 35-52.

[16] Train, K. and W.W. Wilson (2008): “Estimation on stated-preference experi-

ments constructed from revealed-preference choices”, Transportation Research B

42, 191-203.

Extreme values, invariance and selection probabilities

Documents

Transcript of Extreme values, invariance and selection probabilities