Estimating immigration in neutral communities: theoretical and practical insights into the sampling...

Estimating immigration in neutral communities:

theoretical and practical insights into the sampling

properties

Francois Munoz1* and Pierre Couteron2

1UM2 – UMR AMAP (botAnique et bioinforMatique de l’Architecture des Plantes), Boulevard de la Lironde,

TA A-51 ⁄PS2, F-34398 Montpellier Cedex 5, France; and 2IRD – UMR AMAP (botAnique et bioinforMatique de

l’Architecture des Plantes), Boulevard de la Lironde, TA A-51 ⁄PS2, F-34398 Montpellier Cedex 5, France

Summary

1. Widening applications of neutral models of communities necessitates mastering the process of

inferring parameters from species composition data. In a previous paper, we introduced the novel

conditional GST(k) statistic based on community composition. We showed that it is a reliable basis

for assessing migrant fluxes into local communities under a generalized version of the spatially

implicit neutral model of SP Hubbell, which can accommodate non-neutral patterns at scales

broader than the communities.

2. We provide here new insights into the sampling properties of the GST(k) statistic and on the

derived immigration number, I(k). The analytical formulas for bias and variance are useful to assess

estimation accuracy and investigate the variation of I(k) across communities.

3. Immigration estimation is asymptotically unbiased as sample size increases. We confirm the

validity of our analytical results on the basis of simulated neutral communities.

4. We also underline the potential of using I(k) as a descriptive index of community isolation,

without reference to anymodel of community dynamics.

5. We further propose a practical application of the bias and variance analysis for defining

sampling designs for immigration quantification by efficiently balancing the number and size of

community samples.

Key-words: bias, community isolation, GST(k) statistic, immigration, spatially implicit neu-

tral model, unified neutral theory of biodiversity and biogeography, variance

Introduction

Since the seminal work ofHubbell (2001), the neutral theory of

biodiversity has raised much interest by providing a nontrivial

null hypothesis against which the nature of real-world bio-

diversity patterns may be tested and discussed (McGill 2003;

Nee & Stone 2003; Volkov et al. 2003; Leigh 2007). Under the

key assumption of ‘ecological equivalence’, namely that indi-

viduals have equal fitness whatever their species or habitat,

Hubbell (2001) and followers (Etienne 2005) proposed a two-

level spatially implicit neutral model (so-called 2L-SINM by

Munoz, Couteron, & Ramesh 2008; see a synthesis in Beera-

volu et al. 2009 and present Fig. 1a left) to decouple the

dynamics of the large biogeographical background (i.e. the

metacommunity), in which the effect of infrequent speciation

and extinction events is quantified via the biodiversity parame-

ter h (speciation-drift balance), from local community dynam-

ics, which is controlled by the immigration parameter, I

(migration-drift balance, Etienne & Olff 2004). The theory has

found some support from the investigation of very rich tree

communities in wet evergreen tropical forests, where alterna-

tive models were not satisfactory for explaining some observed

diversity patterns, such as the overrepresentation of rare spe-

cies in the species abundance distributions (SAD) (Hubbell

2001; Chave, Alonso, & Etienne 2006). From a theoretical

standpoint, neutral models can help us grasp how the interac-

tion between limited immigration and local ‘ecological drift’

(Hubbell 2001) may influence taxonomic composition.

Munoz, Couteron, & Ramesh (2008) have further proposed to

relax the assumptions made by Hubbell (2001) at the upper

level (i.e. the metacommunity) and to generalize the initial 2L-

SINM into a 3L-SINM, by introducing an intermediate level,

the regional pool of migrants, which may be shaped by*Correspondence author. E-mail: [email protected]

Correspondence site: http://www.respond2articles.com/MEE/

Methods in Ecology and Evolution doi: 10.1111/j.2041-210X.2011.00133.x

� 2011 The Authors. Methods in Ecology and Evolution � 2011 British Ecological Society

non-neutral influences (Fig. 1a right). Under the 3L-SINM,

the study of the local migration-drift equilibrium can be dis-

connected from non-neutral influences occurring at larger

scales, via the estimation of the immigration parameter I.

Although inferring Hubbell’s parameters motivated a great

deal of research in recent years leading to several estimation

methods (Volkov et al. 2003; Etienne 2005, 2009;Munoz et al.

2007;Munoz, Couteron, &Ramesh 2008), the rapidly growing

number of applications to real-world data sets has raised ques-

tions about how to assess the reliability of the inference, even

in the strictly neutral case. Furthermore, published methods

have frequently led to substantially differing parameter esti-

mates when applied to the same data sets (see for instance Lati-

mer, Silander, & Cowling 2005; Etienne et al. 2006; Etienne

2009, Table 1). The lack of knowledge on confidence limits has

discouraged wider andmore systematic applications of neutral

theory through meta-analyses on parameter values. It is also

hindering the development and application of more realistic

models of community assemblages. Our aim is to provide such

a statistically informed perspective on the estimation of the

immigration parameter under the 3L-SINM.

The immigration parameter I is here central and represents

the number of potential migrants competing with resident off-

spring at each mortality event within the community, and as

such, it is closely related to Hubbell’s (2001) migration proba-

bility, m ¼ IIþN�1 (N notes the size of a local community). The

neutral theory basically invokes ‘dispersal limitation’ from the

metacommunity to the local community to define I and m

(Hubbell 2001; Etienne 2005). But when these parameters are

estimated from real-world data, they may also include the

effects of many processes that altogether determine the relative

isolation and differentiation of the local community from its

The spatially-implicit neutral models (SINM) with two levels (2L-SINM, left) and three levels (3L-SINM, right).

2L-SINM 3L-SINM

Estimation context

ˆ(k) is to be estimated for a sample k from the Iobserved species abundances. Sampling error occurs from uncontrolled random variation in species abundances (issue addressed in the present paper).

I(k) controls the probability of drawing a new individual that belongs to a lineage not represented in the sample (Etienne 2005, Munoz et al. 2008), but which may be already present in the community.

I(k) controls the probability that a dead individual is replaced by a migrant from outside the community.

Immigration Immigration

Metacommunity

Community i

Community j Community k

I(i) I(j) I(k)

Pool of migrants

Community i

Community j Community k

I(i) I(j) I(k)

Biogeographical scale

Estimation

Theo

ryD

ata

Community k I(k), equation (1)

Sample k I(k), equation (1)

Sample k (k), equation (2) I

Pool of migrants

(a)

(b)

Fig. 1. Immigration theory and estimation in a network of communities. (a) Two versions of a spatially implicit neutral model are considered,

including two levels (2L-SINM, conforming to Hubbell 2001; left) and three levels (3L-SINM,Munoz, Couteron, & Ramesh 2008; right). In the

2L-SINM, the migrants are randomly drawn from a large biogeographical source and the model is neutral at all scales, while in the 3L-SINM,

the local community dynamics is neutral but the model allows for non-neutral sources of variation within the pool of immigrants (the third level).

(b) summarizes the sampling and estimation issues addressed in the paper. All of the sampled communities k (e.g. forest plots) are assumed to be

independent to allow the application of equation 1 as in Munoz, Couteron, & Ramesh (2008), and the immigration numbers are estimated from

equation 2 (the hat notation represents estimated values).

2 F. Munoz & P. Couteron

� 2011 The Authors. Methods in Ecology and Evolution � 2011 British Ecological Society, Methods in Ecology and Evolution

biogeographical background. I andm could therefore be taken

as an integrative measure of community isolation and differen-

tiation, even though the strictly neutral ‘dispersal limitation’

invoked by Hubbell is not the only process shaping their val-

ues. An estimation of those immigration parameters could

therefore be especially useful for quantifying how the fragmen-

tation of threatened habitats (e.g. tropical forests) may influ-

ence the diversity in isolated patches taken as different local

communities. More generally, analyses comparing consistent

estimates of I and m across biogeographical regions and bio-

mes would be a valuable contribution to community ecology

and conservation biology by providing a measure of apparent

fragmentation, which could be related to the ecological, phys-

iographical and historical peculiarities of the different regions.

This research programme requires increasing our knowledge

about the sampling properties of I andm, which is the purpose

of the present paper.

Neutral models used for inference of I and m initially

focused on species abundances in a single community (e.g.

Hubbell 2001; Etienne 2005; Latimer, Silander, & Cowling

2005), while most recent work (Etienne 2007, 2009; Munoz

et al. 2007;Munoz, Couteron, &Ramesh 2008; Jabot&Chave

2009) has allowed estimating neutral parameters from species

compositions in a set of spatially scattered sampled communi-

ties, and provide estimations of I(k) and m(k) for each sample

k. In this regard, Munoz, Couteron, & Ramesh (2008) have

proposed a conditional version of the classical GST statistic

from population genetics (Nei 1973; Takahata & Nei 1984;

Slatkin 1985), GST(k), and have showed that I(k) can be

inferred from GST(k) under the more general three-level spa-

tially implicit neutral model (3L-SINM, Munoz, Couteron, &

Ramesh 2008; Beeravolu et al. 2009). Although this GST(k)-

based estimation of neutral immigration performedwell on the

basis of simulated neutral communities (Munoz, Couteron, &

Ramesh 2008), there is still a lack of theoretical knowledge

regarding the sampling properties of the method (as for all of

the published alternative methods).We bring here new insights

into the estimation of immigration based on the GST(k) statis-

tic, by (i) providing analytical formulas, independent of any

community model, for bias and variance of the similarity sta-

tistics underlying the computation of GST(k) and I(k) and (ii)

analysing the sensitivity of the estimation process to variation

in the target parameters and to the main features of the data,

such as the number of samples and the sample sizes. These

results will yield practical insights for sampling design by

allowing for the choice of a favourable trade-off in the number

and size of community samples, according to a desired level of

estimation accuracy.

Methodological background

Let us consider Nc distinct ecological communities indexed

by k. The probability that an individual in community k is

from species i is denoted by pik, where i = 1…S, and pi is

the probability that an individual in the lumped set of com-

munities is from species i. A sample is drawn from each com-

munity and includes a collection of individuals identified at a

consistent taxonomic level (e.g. species), such as tree census

plots scattered in a tropical forest. In practice, we assume

that the samples are far enough away from one another so as

to belong to distinct communities (see Munoz, Couteron, &

Ramesh 2008). This allows attributing the same index,

1 £ k £ Nc, to both the source communities and the corre-

sponding samples. Sample k is then made of Nk individuals

belonging to S species, and it thereby includes Nik individuals

of each arbitrary species i. The lumped set of community

samples further includes Ni individuals of each species i sum-

ming to N individuals.

The classical statistic GST (Nei 1973) is the ratio of

beta diversity over gamma diversity, to use ecological con-

cepts, which was defined from similarity statistics that

directly relate to the Simpson’s diversity indexes (Simpson

1949; Couteron & Pelissier 2004). GST is thereby a stan-

dardized measure of the variation in composition across

samples. The conditional version, GST(k), was proposed

by Munoz, Couteron, & Ramesh (2008) to measure this

information relative to each local community sample. It

integrates local similarity Fintra(k), which is the probability

of drawing two conspecific individuals conditional on

drawing them in sample k, and global similarity Fglobal(k),

which is the probability of drawing two conspecific indi-

viduals conditional on drawing the first in sample k and

the other in the overall data set. We also denoted Finter(k)

as the probability of drawing two conspecific individuals

conditional on drawing the first one in sample k and the

other one in another sample distinct of k. We will finally

consider the probability of simultaneously drawing n indi-

viduals that belong to the same species conditional on

drawing them in sample k, denoted as Fn(k). Specifically,

Fintra(k) = F2(k).

THE SPATIALLY IMPLIC IT NEUTRAL MODELS AND THE

GST (K ) STATIST IC

A spatially implicit neutral model (SINM) represents the

dynamics of local community composition, consisting of spe-

ciation, extinction and immigration events, without any expli-

cit reference to the relative distances between communities.

This family of models was introduced in neutral community

ecology by Hubbell (2001). It is intrinsically hierarchical and

aims to decouple biogeographical and ecological scales.

Indeed, evolutionary, biogeographical and ecological pro-

cesses together influence species diversity at different temporal

and spatial scales, and hierarchical models can therefore be

relevant and helpful for representing such nested influences

(Beeravolu et al. 2009). In the two-level spatially implicit neu-

tral model (2L-SINM, Fig. 1a left) of Hubbell (2001), species

abundances in the large-scale metacommunity are driven by

the balance between speciation and extinction events (specia-

tion-drift equilibrium), under the control of the biodiversity

parameter h. At the scale of a local community, species abun-

dances are shaped by a balance between immigration from

the metacommunity and local extinction (migration-drift

equilibrium). The immigration parameter I is the number of

Accuracy in estimating immigration 3


immigrants that compete with local offspring when an indi-

vidual dies in the local community. This important parameter

was first named the ‘fundamental dispersal number’ (Etienne

& Alonso 2005), but the concept can be enlarged to include

other sources of immigration limitation from the background

region (Munoz, Couteron, & Ramesh 2008). Munoz et al.

(2007) showed that the 2L-SINM of Hubbell can accommo-

date varying immigration conditions across local communities

k, by allowing varying values of I(k), and several recent

approaches have been proposed to estimate I(k) in the pres-

ence of this variation (Etienne 2007, 2009; Munoz, Couteron,

& Ramesh 2008; Jabot & Chave 2009).

In the three-level spatially implicit neutralmodel (3L-SINM,

Fig. 1a right) of Munoz, Couteron, & Ramesh (2008), the lar-

ger-scale source of immigrants is not necessarily considered to

be a metacommunity at speciation-drift equilibrium because

the model allows for various processes to occur at an interme-

diate scale, thereby providing greater flexibility for represent-

ing real-world patterns and migration pathways (Fig. 1 in

Munoz, Couteron, & Ramesh 2008). In this case, contrary to

the 2L-SINM, there is no assumption about the SAD of the

immigrants. In the context of the 3L-SINM, Munoz, Couter-

on, & Ramesh (2008) introduced the computation of

GSTðkÞ ¼ FintraðkÞ�FglobalðkÞ1�FglobalðkÞ and showed that GST(k) directly

relates to the immigration parameter, I(k), of the source com-

munity k, through the relationship:

GSTðkÞ ¼1

1þ IðkÞ= 1� Pðk=kÞð Þ ; eqn 1

Here, P(k ⁄k) is the probability of drawing an individual

in sample k conditional on first drawing an individual in

k, which is measured as Pðk=kÞ ¼ Nk�1N�1 . Applications to

simulated neutral communities showed that exact estima-

tors of the similarities,

FintraðkÞ ¼XSi¼1

Nik

Nk

� �Nik � 1

Nk � 1

� �and

FglobalðkÞ ¼XSi¼1

Nik

Nk

� �Ni � 1

N� 1

� �;

are to be used to get unbiased estimates of the immigra-

tion parameters on the basis of equation 1 (Munoz, Cou-

teron, & Ramesh 2008):

IðkÞ ¼ 1� Pðk=kÞð Þ 1

GSTðI; kÞ� 1

!

¼ 1� Pðk=kÞð Þ 1� FintraðkÞFintraðkÞ � FglobalðkÞ

!;

eqn 2

The corresponding immigration rates (i.e. the migration

probability of Hubbell 2001) can be estimated using

mðkÞ ¼ IðkÞIðkÞþNk�1

.

SAMPLING PROPERTIES

Akey issue is to investigate inmore detail the sampling proper-

ties of IðkÞ (equation 2) so as to gain insights about its bias and

variance and to formally assess the degree of confidence in its

estimation. The word ‘sampling’ here conceptually encom-

passes two distinct sources of variation.

First, the neutral theory of Hubbell (2001) allows for ran-

domfluctuations in species abundances owing to the finite sizes

of the community (local drift) and of the metacommunity (glo-

bal drift). Influxes of new species (speciation) in the metacom-

munity and of immigrants in local communities are necessary

to avoid species fixation andmonodominance, and tomaintain

species diversity. The stochastic nature of the replacement of

dead individuals was shown to be strictly analogous to a sam-

pling process (Etienne & Alonso 2005), so that Etienne (2005)

and followers could derive exact sampling formulas for species

abundances in local community samples k (Etienne 2005,

2007; Munoz, Couteron, & Ramesh 2008; Noble et al. 2010).

The relationship between GST(k) and I(k) in equation 1 holds

for any sample taken in a given community (Fig. 1b, part

above the dashed line; see also Munoz, Couteron, & Ramesh

2008).

Second, although equation 1 is exact for any sample in com-

munity k, it only provides an estimate of I(k), denoted IðkÞ,which is marked by random fluctuations in the constitutive

similarity statistics within GST(k), because the species frequen-

cies in samples are themselves random variables. Thus, the I(k)

notation embodies the sampling variation because of ecologi-

cal drift in neutral theory (equation 1 and Fig. 1b, above

the line), while IðkÞ further includes the sampling variation

because of estimation error of similarity statistics (equation 2

and Fig. 1b, below the dashed line). The latter source of

sampling error is characterized by the ‘hat’ notation and is

the focus of the present work. We intend to gain insights into

this sampling variation by applying a multinomial model of

sample draws from the corresponding source community. This

model requires assuming, as a first approximation, that the

communities are large enough compared to the samples to

allow the drawing of samples from the communities with

replacement.

Analytical results

We investigated the sampling properties of GSTðkÞ and IðkÞ,using a delta method for calculating approximate sampling

means and variances (Davison 2003). We then provided fur-

ther analytical results for IðkÞ in the particular case of the two-

level version (2L-SINM). We used Mathematica � (Wolfram,

2003) for Taylor series expansions and investigation of the

limit cases.

SAMPLING VARIAT ION IN SIMILARITY STATIST ICS

We investigated the variation among many samples succes-

sively drawn in a given community to predict the error in esti-

mating similarities from a single sample. As mentioned above,

we assumed each community sample to be small enough com-

pared to the source community, so that the individuals making

up a sample are drawn from the source community with

replacement. Species abundances in a sample made of Nk

individuals follow a multinomial distribution with parameters



pik, the species probabilities in the reference community. We

therefore investigated the influence of multinomial variation

on the sampling properties of IðkÞ.We used binary indicator

functions (Cormen et al. 2001) of species identity to explore

the variations of the similarities, as Lande (1996) did for inves-

tigating Simpson’s diversity.

Specifically, for FintraðkÞ ¼PSi¼1

nikNk

� �nik�1Nk�1

� �, we established

(Appendix S1) that

E FintraðkÞ� �

¼ FintraðkÞ; eqn 3A

and

Var FintraðkÞ� �

� 4

NkF3ðkÞ � FintraðkÞ2� �

þO1

N2k

� �;

eqn 3B

as in Lande (1996).

Appendix S1 further shows that the two other similarity

statistics, FinterðkÞ ¼PS

i¼1Nik

Nk

� �Ni�Nik

N�Nk

� �and FglobalðkÞ ¼PS

i¼1Nik

Nk

� �Ni�1N�1� �

, are also unbiased:

E FinterðkÞ� �

¼ FinterðkÞ eqn 3C

E FglobalðkÞ� �

¼ FglobalðkÞ eqn 3D

The sampling properties of the similarity statistics here hold

irrespective of any underlyingmodel of species dynamics.

SAMPLING ERROR ON IðkÞ AND mðkÞ

From this premise, we turned to investigate the sampling error

in the parameters IðkÞ and mðkÞ from their relationships

with GSTðkÞ and the constitutive similarity statistics FintraðkÞand FglobalðkÞ, such as IðkÞ ¼ g FintraðkÞ; FglobalðkÞ

� �, with

g x; yð Þ ¼ 1� Pðk=kÞð Þ 1�xx�y (from equation 2) and mðkÞ ¼f IðkÞ� �

, with f zð Þ ¼ zzþNk�1.

We denote the derivatives g0xðxÞ ¼@gðxÞ@x , g00

x2ðxÞ ¼ @2gðxÞ

@x2,

g00xyðxÞ ¼@2gðxÞ@x@y , and so on. Using a delta method based on Tay-

lor series expansions (Davison 2003; see Appendix S2), we

could obtain approximate expected values and sampling vari-

ances of IðkÞ:

EðIðkÞÞ � gð �X; �YÞ þ CovðX;YÞg00xyð �X; �YÞ

þ 1

2VarðXÞg00x2 ð �X; �YÞ þ 1

2VarðYÞg00y2ð �X; �YÞ

eqn 4A

VarðIðkÞÞ � g0xð �X; �YÞ2VarðXÞ þ g0yð �X; �YÞ2VarðYÞþ 2CovðX;YÞg0yð �X; �YÞg0xð �X; �YÞ

eqn 4B

with X ¼ FintraðkÞ, Y ¼ FglobalðkÞ, �X ¼ E FintraðkÞ� �

and�Y ¼ E FglobalðkÞ

� �.

We assumed the sampling variance of FglobalðkÞ and the

covariance of FglobalðkÞ and FintraðkÞ to be negligible com-

pared with the variance of FintraðkÞ. The assumption is fairly

reasonable, insofar as FglobalðkÞ depends on the sampling

variation in the large overall data set (lumped samples),

which is automatically an order of magnitude smaller than

the sampling variation in each community sample k, as far

as there are many samples. Thus, under the assumptions of

a network of many small community samples, equations 4A

and 4B simplify into EðIðkÞÞ � gð �X; �YÞ þ 12 VarðXÞg00x2 ð �X; �YÞ

and VarðIðkÞÞ � g0xð �X; �YÞ2VarðXÞ, with g0xðx; yÞ ¼� 1� Pðk=kÞð Þ 1�y

x�yð Þ2 and g00x2 ðx; yÞ ¼ 2 1� Pðk=kÞð Þ 1�yx�yð Þ3.

Replacing �X and �Y with the corresponding expected values

of the similarity statistics (equations 4A and 4B) yielded the

following relationship:

EðIðkÞÞ � 1� Pðk=kÞð Þ 1� FintraðkÞFintraðkÞ � FglobalðkÞ

1þ 1� FglobalðkÞ1� FintraðkÞ

VarðFintraðkÞÞFintraðkÞ � FglobalðkÞ� �2

!

Using equation 1, this simplified into

EðIðkÞÞ� IðkÞþ IðkÞþ1�P k=kð Þð Þ VarðFintraðkÞÞFintraðkÞ�FglobalðkÞ� �2 :

Likewise,

VarðIðkÞÞ� 1�Pðk=kÞð Þ 1�FintraðkÞFintraðkÞ�FglobalðkÞ

� �2

1�FglobalðkÞ1�FintraðkÞ

� �2 VarðFintraÞFintraðkÞ�FglobalðkÞ� �2 :

By denoting TðkÞ¼ VarðFintraðkÞÞFintraðkÞ�FglobalðkÞð Þ2, we finally got

VarðIðkÞÞ � IðkÞ þ 1� P k=kð Þð Þ2TðkÞ eqn 5A

and the estimation of bias

Bias IðkÞ� �

¼ EðIðkÞÞ � IðkÞ ¼ IðkÞ þ 1� P k=kð Þð ÞTðkÞeqn 5B

The quantities IðkÞ þ 1� P k=kð Þð ÞandT(k) are central herefor characterizing the sampling error because they control both

the bias and the variance in estimation of IðkÞ. Using a least-

square loss function, the best bias-variance trade-off is

obtained byminimizing themean square error

MSE ¼ Bias2 þ Var ¼ EðIðkÞÞ � IðkÞ� �2þVarðIðkÞÞ

¼ IðkÞ þ 1� P k=kð Þð Þ2TðkÞ TðkÞ þ 1ð Þ:

Recall that Var FintraðkÞ� �

� 4Nk

F3ðkÞ � FintraðkÞ2� �

to

getTðkÞ ¼ 4Nk

F3ðkÞ�FintraðkÞ2

FintraðkÞ�FglobalðkÞð Þ2. Let us consider thatP(k ⁄k)=

(Nk ) 1) ⁄ (N ) 1) primarily depends on the fixed number of

sites sampled, Nc = N ⁄Nk, so that P k=kð Þ � 1=Nc. I(k) and

the actual similarity statistics (without ‘hats’) are independent

fromNk, so that both the estimation bias and variance decrease

as 1 ⁄Nk; hence,

EðIðkÞÞ � IðkÞ þO1

Nk

� �; biasðIðkÞÞ2 � O

1

N2k

� �

andVarðIðkÞÞ � O 1Nk

� �.



As a consequence,MSE = O(1 ⁄Nk), and the sampling vari-

ance is here more constraining for optimization. We further

investigated the sampling properties of

mðkÞ ¼ f IðkÞ� �

¼ IðkÞIðkÞ þNk � 1

;

by using the deltamethod, so that

E mðkÞð Þ � f E IðkÞ� ��

þ 1

2VarðIðkÞÞf00x E IðkÞ

� �� ¼ mðkÞ

þ 1

NkTðkÞ IðkÞ þ 1� Pðk=kÞð Þ þO

1

N3k

� �

andVar mðkÞð Þ � Var IðkÞ� �

f0x E IðkÞ� �� 2

with

f0x E IðkÞ� ��

¼ Nk � 1

IðkÞ � 1þNk þ TðkÞ 1þ IðkÞ � Pðk=kÞð Þð Þ2

¼ O1

Nk

� �:

Therefore, BiasðmðkÞÞ � O 1N2

k

� �and VarðmðkÞÞ � O 1

N3k

� �.

In this case, MSE = O(1 ⁄Nk3), and the faster decrease in esti-

mation error when community samples get large may have

been a reason for preferring investigatingm(k) in earlier appli-

cations of neutral theory (Hubbell 2001; Etienne 2005; Munoz

et al. 2007).

The sampling results here are not dependent on any assump-

tion on the nature of species assembly, as we have only analy-

sed nonlinear functions of similarity statistics. At this stage,

IðkÞ and mðkÞ can be used as heuristic indexes of community

isolation from their common background (similarly to using

diversity indexes). But one may also want to further interpret

and discuss their nature as immigration parameters in the con-

text of the 3L-SINM,where there is still no a priori expectation

for Fintra(k), F3(k) and Fglobal(k). The derivations below are

given for the particular case of the 2L-SINM, where we can

further specify the nature of the migrant pool and the species

abundance distributions therein to predict these values.

SAMPLING ERROR FOR THE TWO-LEVEL SPATIALLY

IMPLIC IT NEUTRAL MODEL (2L-S INM)

Let us now consider the well-known case of the 2L-SINM of

Hubbell (2001), which is a particular case of the 3L-SINM.We

analytically derived the expected probabilities Fn(k) in a given

community sample k as explicit functions of the parameters hand I(k) (Appendix S3). The analytical formulas for the first

values of n are

F2ðkÞ ¼ FintraðkÞ ¼ 1� hhþ 1

IðkÞIðkÞ þ 1

eqn 6A

as presented by Etienne (2005),

F3ðkÞ¼2IðkÞ2þ3IðkÞðhþ2Þþ2ðhþ1Þðhþ2Þ

IðkÞþ1ð Þ IðkÞþ2ð ÞCðhþ1ÞCðhþ3Þ ;and

eqn6B

whereC is theGamma function.

The three similarity statistics, Fintra(k), Fglobal(k) and

Finter(k), defined above, are linked through the relationship

FglobalðkÞ ¼ Pðk=kÞFintraðkÞ þ 1� Pðk=kÞð ÞFinterðkÞ(Munoz,

Couteron, & Ramesh 2008). Here, we recall that Finter(k)

is the probability that two individuals, one drawn in k and

the other in another community, are conspecific. Individu-

als drawn from distinct communities are descendants of

distinct immigrants from the common pool of migrants

(Etienne & Olff 2004), and hence, Finter(k) is the probabil-

ity that two individuals are conspecific in the pool of

migrants (Munoz, Couteron, & Ramesh 2008: equation 3).

In the context of the 2L-SINM, the pool of migrants is

directly the metacommunity (Hubbell 2001); hence,

FinterðkÞ ¼ 1� hhþ1 (Ewens 1972) and then

FglobalðkÞ ¼ Pðk=kÞFintraðkÞ þ 1� Pðk=kÞð ÞFinterðkÞ

¼ 1

hþ 1þ Pðk=kÞ h

IðkÞ þ 1ð Þ hþ 1ð Þ :

eqn 6D

Under the 2L-SINM, the expected values of Fintra(k),

F3(k) and Fglobal(k) are therefore functions of the two fun-

damental parameters I(k) and h, and we can derive

EðIðkÞÞ and VarðIðkÞÞ by calculating the exact expression

of T(k) as a function of I(k) and h:

TRADE-OFF BETWEEN SAMPLE SIZE AND NUMBER OF

SAMPLES

The above results allow for the assessment of sampling designs

and for delineating a domain of efficient parameter inference.

Let us consider, for instance, forest plots in wet evergreen trop-

ical forests, which are a classical field of application of neutral

models. Different sampling strategies are possible with sample

sizes ranging, for illustration purposes, from 0,1 ha plots

including Nk = 50 individuals on average to 1 ha plots

including 500 individuals or more (Munoz, Couteron, &

Ramesh 2008). The mean square error MSE = Bias2 + Var

can be used as an accuracy index measuring the performance

F4ðkÞ ¼6IðkÞ3Cðhþ 3Þ þ ð12IðkÞ2 þ 11IðkÞðhþ 2Þ þ 6ðhþ 1Þðhþ 2ÞÞCðhþ 4Þ

IðkÞ þ 1ð Þ IðkÞ þ 2ð Þ IðkÞ þ 3ð Þh:CðhÞ

Cðhþ 3ÞCðhþ 4Þ eqn 6C

TðkÞ ¼ 2IðkÞ 2 Nk � 1ð Þ hþ 1ð Þ hþ 2ð Þ þ IðkÞ hþ 2Nk � 2ð Þ IðkÞ þ hþ 3ð Þð ÞIðkÞ þ 2ð Þ Nk � 1ð ÞNk 1� Pðk=kÞð Þ2h hþ 2ð Þ

¼ O1

Nk

� �eqn 7



of a given sampling design. The smaller is MSE, the better is

the estimation of immigration.

In an application of the 2L-SINM of Hubbell (2001),

Fig. 2A shows contours of constant MSE when the sample

size Nk (abscissa) and the number of samples Nc (ordinate)

are varying in a case where I = 10 and h = 50, which are

fairly realistic values for tropical forests in South India

(Munoz et al. 2007; Munoz, Couteron, & Ramesh 2008).

The log10 values of MSE are shown along with the

contours. This figure shows that one cannot improve MSE

over a certain limit for a given sample size because increas-

ing the number of samples leads to an asymptote in MSE.

Specifically, for 0,1-ha forest plots of about Nk = 50,

log10(MSE) cannot go below 1,09, while for 1-ha forest

plots of about Nk = 500, it can go down to 0,016, and the

asymptotic difference in accuracy is a factor of about 12 in

absolute scale. Hence, one cannot expect good estimation

from community samples that are too small, even if there

are many samples. For illustration, consider a fixed overall

sampling effort of N = 5000 individuals, which can be

achieved for different sample sizes Nk and sample numbers

Nc such as, for instance, Nc = 50 samples of size Nk = 100

(Fig. 2A, point a), or Nc = 10 samples of size Nk = 500

(Fig. 2A, point b). In the former case, log10(MSE) = 0,8,

while in the latter case, log10(MSE) = 0,1, so that the

MSE is improved by a factor 4,6 with fewer large samples.

Furthermore, the arrow in Fig. 2A indicates that sampling

10 samples of 500 individuals is as accurate as sampling 30

samples of 440 individuals because both designs stand on

the same isoline of log10(MSE) = 0,1. In the former case,

only 5000 individuals are to be sampled, compared with

13 200 in the latter case. Therefore, it is better for estima-

tion performance to rely on fewer large samples, as long as

the assumption that the samples are smaller than the corre-

sponding communities is correct.

It is also possible to use the derivatives of MSE to investi-

gate the gain in performance when increasing sampling pres-

sure. Let us fix, for instance, a qualitative limit atdMSE

dNc¼ �0; 05, above which the gain is arbitrarily judged

no longer worth it. Fig. 2B (solid line) shows the correspond-

ing number of samples as a function of Nk. For instance, 24

samples of 50 individuals or seven samples of 500 individuals

are located at the same level of dMSEdNc

¼ �0; 05, and this

illustrates again that the advantage of getting additional

samples closely depends on sample size. One may use other

isoclines of dMSEdNc

(e.g. in Fig. 2B, isoclines at )0,01 repre-

sented by the dashed line and at )0,005 represented by the

dotted line) to reach a desired limit in estimation perfor-

mance when increasing the sample number. Information on

sampling properties is therefore useful for establishing a sam-

pling strategy prior to field work. The approach is analo-

gous, in principle, with using saturation curves and

exploratory statistics to fix the size of sampling plots for the

assessment of diversity in a local community (Gimaret-Car-

pentier et al. 1998).

Simulation application

SIMULATED NEUTRAL COMMUNITY SAMPLES

We used the two-step simulation algorithm of Munoz et al.

(2007, Appendix) in Matlab�(Mathworks, 2004) to simulate

neutral community samples complying with the 2L-SINM

of Hubbell (2001). First, we used the sequential algorithm of

Etienne (2005) to generate a pool of migrant ancestors

(migrant pool) as a very large metacommunity sample. The

0

0,1

0,2 0,3 0,4 0,5 0,6

0,7

0,8

0,9

1

1,1

100 200 300 400 500 600

510

2040 (a)

(b)

(A) Isolines of MSE N

umbe

r of s

ampl

es N

c

100 200 300 400 500

1030

5070

−0,005−0,01−0,05

dMSEdNc

=

(B) dMSE

dNc

Num

ber o

f sam

ples

Nc

Sample size Nk

Sample size Nk

Isolines of

Fig. 2. Trade-off between sample size (Nk) and number of samples

(Nc) for estimating immigration, as measured using the MSE statistic

of squared bias plus variance. In (A), isolines of constant MSE value

are shown with 0Æ1 increments on log10 scale. (a) and (b) are cases

where 5000 individuals are sampled overall using, respectively, 10

samples of 500 individuals and 50 samples of 100 individuals. The

arrow shows that 10 samples of 500 individuals (5000 individuals

overall) and 30 samples of 440 individuals (13 200 individuals overall)

are at the same level of MSE = 0Æ1. In these cases, it is less expensive

to use larger samples. (B) shows the value of Nc on ordinates such asdMSE

dNcis equal to )0,05 (solid line), )0,01 (dashed line) and )0,005

(dotted line) for a range of local sample sizes on abscissa. For these

examples, the statistics were calculated in the context of the 2L-SINM

ofHubbell (2001) with I(k) = 10 and h = 50.



pool was characterized by the biodiversity parameter h,which controls the speciation-drift equilibrium. We repeated

the procedure for two values of h, 50 and 100, which illus-

trates a range of biodiversity figures of tree communities in

semi-evergreen and evergreen tropical forests (Hubbell 2001;

Munoz et al. 2007). For each migrant pool representing a

shared biogeographical background (metacommunity), we

generated 50 local community samples k with constant local

immigration numbers I(k) (see Munoz et al. 2007 for more

details). We repeated the procedure to generate 50 indepen-

dent migrant pools as replicates, for each of which we simu-

lated 50 community samples, so as to get a total of 2500

community samples for each value of I(k). We varied I(k)

from 10 to 200 with increments of 10, and we considered

two sample sizes: Nk = 400 and 800 (wet evergreen tropical

forest plots of about 1 ha usually fall within this range). We

ensured that the migrant pools included a significantly larger

number of individuals than the derived local community

samples, namely 50Nk, and ensured that the results remained

consistent when sizes ranging from 25Nk to 1000Nk were

used.

SIMULATED VS. PREDICTED SAMPLING

CHARACTERISTICS

For given values of I(k),Nk and h, we calculated the mean and

standard deviation of IðkÞ using equations (5A), (5B) and (7)

on the basis of 50 simulated community samples associated

with amigrant pool. To avoid spurious effects because of some

rare outliers, we calculated 95% trimmed statistics by exclud-

ing 2,5% of observations from each side. We repeated the pro-

cedure for each of the fifty replicate migrant pools and for each

of the 20 values of I(k). Figure 3 shows the departures of

observed bias and variance from our theoretical predictions.

The results are presented for three combinations of parame-

ters: h = 50, Nk = 400 (a and b); h = 50, Nk = 800 (c and

d); and h = 100,Nk = 400 (e and f).

We conducted extensive simulation experiments (results not

shown) to verify that the sampling properties remained consis-

tent for varying sizes of the pool of migrants (from 25Nk to

1000Nk) and for varying numbers of community samples

(from 50 to 250 samples) associated with a migrant pool. In

spite of the approximationsmade, our analytical predictions of

0 50 100 150 200

−0·6

−0·2

0·2

0·6(a)

ΔMea

n/I (

k)ΔM

ean/

I (k)

ΔMea

n/I (

k)

0 50 100 150 200

−1·0

0·0

0·5

1·0(b)

0 50 100 150 200

−0·6

−0·2

0·2

0·6(c)

0 50 100 150 200

−1·0

0·0

0·5

1·0(d)

0 50 100 150 200

−0·6

−0·2

0·2

0·6(e)

I (k)0 50 100 150 200

−1·0

0·0

0·5

1·0(f)

ΔSTD

/I (k

)ΔS

TD/I

(k)

ΔSTD

/I (k

)

I (k)

Fig. 3. The differences divided by I(k) between simulated and analytical values of themean (DMean; a, c, e) and standard deviation (DSTD; b,d,f)

of IðkÞ, as a function of I(k) (abscissa). The results are provided for three simulated data sets that comply with the 2L-SINM of Hubbell (2001),

with parameters h = 50,Nk = 400 (a, b); h = 50,Nk = 800 (c, d) and h = 100,Nk = 400 (e, f). There are 50 replicate values for each theoreti-

cal I(k).



sampling variance and mean of the immigration parameter

were globally consistent with simulation results (Fig. 3). The

relative difference between simulated and analytical EðIðkÞÞ(i.e. DMean ⁄ I(k)) wavered around 0 over the range of I(k) val-

ues, showing good agreement for varying configurations of handNk (Fig. 3a,c,e).We still noted that the width of the scatter

increased at values of I(k) above 75–100. On the other hand,

despite a good overall fit, the sampling standard deviation was

slightly underestimated at small I(k) (ca. under 75; Fig. 3b,d,f;

DSTD ⁄ I(k)>0). We found quite similar results for mðkÞ(results not shown).

Discussion

Our central result here is that estimating immigration parame-

ters I(k) and m(k) is asymptotically unbiased for large enough

samples, and it is an important and original result as far as the

novel conditional version GST(k) of GST is concerned. The

unconditionalFST and related statistics, such asGST, have been

widely investigated and debated in population genetics, espe-

cially about getting reliable enough estimates (Rottenstreich

et al., 2007; Guillot, 2010) and subsequently getting a correct

inference of migration (Whitlock and McCauley, 1999). A

major interest of using the conditional formGST(k) to estimate

immigration parameters I(k) is that it overcomes the problem

of averaging immigration effects across all the communities,

which Whitlock and McCauley (1999) pointed out for the

unconditional FST-based approach.

BecauseGST(k) and I(k) are nonlinear functions of similarity

statistics of analytically established sampling variances, we per-

formed the statistical analysis of their sampling accuracy using

appropriate Taylor series expansions (delta method) to avoid

cumbersome formulas and computation (Appendix S2). The

delta method basically assumes that the sampling statistics are

dominated by lower-order terms of the Taylor series (variances

and covariances), and we further neglected the variance of

Fglobal(k) and the covariance of Fglobal(k) and Fintra(k) against

the variance of Fintra(k). The simulation application for com-

munity samples complying with Hubbell’s model confirmed

that the analytical results based on these approximations were

reliable enough for a range of realistic parameters (sample size

and values of I(k) as from surveys in wet evergreen tropical for-

ests). To such extent, the variance of the estimation is mostly

sensitive to the variation in local similarity, Fintra(k). This result

may be robust against departures from Hubbell’s model, and

to verify it, simulations studies are still needed, but these are

beyond the scope of this paper.

In a broad perspective, the bias and variance formulas of

equations 4 and 5 are independent from any underlying model

of community dynamics, and notably, their applicability is not

restricted to neutral models. As such, I(k) andm(k) can be used

as descriptive indexes of community isolation, complementary

to diversity statistics. We may also note that our results (equa-

tions 4 and 5) are straightforwardly applicable to theD statistic

proposed by Jost (2008) as an alternative to GST. In practice,

the variation of I(k) values over a set of sampling sites can be

compared to external environmental information (e.g. geol-

ogy, climate) as to detect possible environmental filtering of

community composition. For this, the analytical results on the

sampling variance of I(k) will become useful to design tests of

departures from a regional mean. In this regard, a perspective

will be to investigate in greater details the distribution of IðkÞto provide further insights into the confidence limits.

Furthermore, the mean square error formula (MSE) based

on sampling bias and variance offers a synthetic measure of

estimation performance, which is of practical interest for

designing efficient sampling schemes (illustration in Fig. 2). In

our application, estimation performance as measured byMSE

reaches a plateau when the size and ⁄or the number of commu-

nity samples is increased. The isolines of MSE in Fig. 2A can

be used to select the appropriate sample number and size for a

desired level of accuracy, which can provide guidance in the

design and evaluation of sampling schemes. It is quite similar

to designing sampling schemes to correctly estimate species

alpha and beta diversities in communities (Gimaret-Carpentier

et al. 1998).

A further important message for community ecologists is

that sampling issues are central to the analysis of community

composition, when stochastic processes and estimation error

are intertwined. Therefore, one should take into account both

(i) the sampling nature of the neutral theory (Etienne&Alonso

2005), and more generally of the fundamental concept of com-

position drift (a process that is likely to be pervasive even when

interacting with non-neutral processes); and (ii) the sampling

error in parameter estimation as in any inference process.

Finally, the strict dichotomy of neutral vs. non-neutral mod-

els is progressively vanishing and recent works have suggested

that the scope of neutral models should be enlarged and that

the predictions on relative species abundances are robust

(Allouche&Kadmon 2009;Noble et al. 2010). Several authors

(Hubbell 2006; Zillio & Condit 2007) argued that the assump-

tion of ecological equivalence of individuals is realistic insofar

as the trade-off between life traits in a group of trophically sim-

ilar species (a guild) may produce comparable levels of fitness.

Furthermore, the three-level spatially implicit neutral model of

Munoz, Couteron, & Ramesh (2008) illustrates how non-neu-

tral processes at regional scale can be incorporated in the tradi-

tional neutral approach. Hence, this 3L-SINM framework

may be applied to a variety of contexts that have to do with

community isolation, including anthropogenic fragmentation

in tropical forests, and it may become a useful approach for

conservation issues (Pearse &Crandall 2004).

Acknowledgements

We warmly thank the editor and an anonymous reviewer for their valuable

comments and suggestions.

References

Allouche, O. & Kadmon, R. (2009) A general framework for neutral models of

community dynamics.Ecology Letters, 12, 1287–1297.

Beeravolu, C.R., Couteron, P., Pelissier, R. & Munoz, F. (2009) Studying eco-

logical communities from a neutral standpoint: a review of models’ structure

and parameter estimation.EcologicalModelling, 220, 2603–2610.



Chave, J., Alonso, D. & Etienne, R.S. (2006) Comparing models of species

abundance.Nature, 441, E1.

Cormen, T.H., Leiserson, C.E., Rivest, R.L. & Stein, C. (2001) Indicator ran-

dom variables. Introduction to Algorithms, 2nd edn, pp. 94–99. MIT Press

andMcGraw-Hill, Cambridge.

Couteron, P. & Pelissier, R. (2004) Additive apportioning of species diversity:

towards more sophisticated models and analyses.Oikos, 107, 215–221.

Davison, A.C. (2003) Statistical Models. Cambridge University Press, Cam-

bridge, UK.

Etienne, R.S. (2005) A new sampling formula for neutral biodiversity. Ecology

Letters, 8, 253–260.

Etienne, R.S. (2007) A neutral sampling formula for multiple samples and an

‘‘exact’’ test of neutrality.Ecology Letters, 10, 608–618.

Etienne, R.S. (2009)Maximum likelihood estimation of neutral model parame-

ters for multiple samples with different degrees of dispersal limitation. Jour-

nal of Theoretical Biology, 257, 510–514.

Etienne, R.S. & Alonso, D. (2005) A dispersal-limited sampling theory for spe-

cies and alleles.Ecology Letters, 8, 1147–1156.

Etienne, R.S. & Olff, H. (2004) A novel genealogical approach to neutral biodi-

versity theory.Ecology Letters, 7, 170–175.

Etienne, R.S., Latimer, A.M., Silander. Jr, J.A. & Cowling, R.M. (2006) Com-

ment on ‘‘Neutral Ecological Theory Reveals Isolation and Rapid Specia-

tion in a BiodiversityHot Spot’’.Science, 311, 610b.

Ewens, W.J. (1972) The sampling theory of selectively neutral alleles. Theoreti-

cal Population Biology, 3, 87–112.

Gimaret-Carpentier, C., Pelissier, R., Pascal, J.-P. & Houiller, F. (1998) Sam-

pling strategies for the assessment of the tree species diversity. Journal of

Vegetation Science, 9, 161–172.

Guillot, G. (2010) Splendor and misery of indirect measures of migration and

gene flow.Heredity, 106, 11–12.

Hubbell, S.P. (2001) The Unified Neutral Theory of Biodiversity and Biogeogra-

phy. PrincetonUniversity Press, Princeton andOxford.

Hubbell, S.P. (2006) Neutral theory and the evolution of ecological equiva-

lence.Ecology, 87, 1387–1398.

Jabot, F. & Chave, J. (2009) Inferring the parameters of the neutral theory of

biodiversity using phylogenetic information and implications for tropical

forests.Ecology Letters, 12, 239–248.

Jost, L. (2008) GST and its relatives do not measure differentiation.Molecular

Ecology, 17, 4015–4026.

Lande, R. (1996) Statistics and partitioning of species diversity, and similarity

amongmultiple communities.Oikos, 76, 5–13.

Latimer, A.M., Silander Jr, J.A. & Cowling, R.M. (2005) Neutral Ecological

Theory Reveals Isolation and Rapid Speciation in a Biodiversity Hot Spot.

Science, 309, 1722–1725.

Leigh, E.G. (2007) Neutral theory: a historical perspective. Journal of Evolu-

tionary Biology, 20, 2075–2091.

Mathworks (2004)Matlab 7.0.Mathworks Inc., Natwick,MA.

McGill, B.J. (2003) A test of the unified neutral theory of biodiversity. Nature,

422, 881–885.

Munoz, F., Couteron, P. & Ramesh, B. (2008) Beta-diversity in spatially impli-

cit neutral models: a new way to assess species migration.The American Nat-

uralist, 172, 116–127.

Munoz, F., Couteron, P., Ramesh, B.R. & Etienne, R.S. (2007) Estimating

parameters of neutral communities: from one Single Large to Several Small

samples.Ecology, 88, 2482–2488.

Nee, S. & Stone, G. (2003) The end of the beginning for neutral theory. Trends

in Ecology & Evolution, 18, 433–434.

Nei,M. (1973) Analysis of Gene Diversity in Subdivided Populations. Proceed-

ings of the National Academy of Sciences of the United States of America, 70,

3321–3323.

Noble, A.E., Temme, N.M., Fagana, W.F. & Keitt, T.H. (2011) A sampling

theory for asymmetric communities, Journal of Theoretical Biology, 273, 1–

14.

Pearse, D.E. & Crandall, K.A. (2004) Beyond F-ST: analysis of population

genetic data for conservation.Conservation Genetics, 5, 585–602.

Rottenstreich, S., Miller, J.R. & Hamilton, M.B. (2007) Steady state of homo-

zygosity and Gst for the island model. Theoretical Population Biology, 72,

231–244.

Rottenstreich, S., Hamilton, M.B. & Miller, J.R. (2007) Dynamics of Fst for

the islandmodel.Theoretical Population Biology, 72, 485–503.

Simpson, E.H. (1949)Measurement of diversity.Nature, 163, 688.

Slatkin, M. (1985) Gene flow in natural populations. Annual Review of Ecology

and Systematics, 16, 393–430.

Takahata, N. & Nei, M. (1984) FST and GST statistics in the finite island

model.Genetics, 107, 501–504.

Volkov, I., Banavar, J.R., Hubbell, S.P. & Maritan, A. (2003) Neutral theory

and relative species abundance in ecology.Nature, 424, 1035–1037.

Whitlock, M.C. & McCauley, D.E. (1999) Indirect measures of gene flow and

migration: FST!=1 ⁄ (4Nm+1).Heredity, 82, 117–125.

Wolfram (2003)Mathematica 5.2. WolframResearch, Inc., Champaign, IL.

Zillio, T. & Condit, R. (2007) The impact of neutrality, niche differentiation

and species input on diversity and abundance distributions.Oikos, 116, 931–

940.

Received 20October 2010; accepted 22May 2011

Handling Editor: Robert Freckleton

Supporting Information

Additional Supporting Information may be found in the online ver-

sion of this article.

Appendix S1. Sampling properties of the similarities.

Appendix S2.Deltamethod to assess the samplingmean and variance

of twice differentiable functions.

Appendix S3. Raw moments of the local species abundance distribu-

tion (SAD) in the context of the two-level spatially implicit neutral

model (2L-SINM) ofHubbell (2001).

As a service to our authors and readers, this journal provides support-

ing information supplied by the authors. Such materials may be

re-organised for online delivery, but are not copy-edited or typeset.

Technical support issues arising from supporting information (other

thanmissing files) should be addressed to the authors.



Appendix A: Sampling properties of the similarities

In a network of community samples k (1 ≤ k ≤ Nc), such that each of them belongs to a

distinct source community, we assumed the community samples k to be multinomial draws

from the corresponding source communities, in which the species probabilities are denoted

pik, respectively. We used Bernouilli indicator variables Giz to investigate the sampling

variation in such multinomial draws. The global indicator variable, Giz, was used to represent

a set of N individuals drawn from the overall set of communities (G is for “global”). Giz = 1 if

an individual z within this draw is of species i, and Giz = 0 otherwise, so that ∑=z

izi GN and

izi z

N G= ∑∑ . Likewise, Kiz are the local indicator variables for the multinomial draws of Nk

individuals from local communities, so that ∑=z

izik KN .

If two distinct individuals z and z’ from sample k belong to the same species, then KizKiz’

= 1, otherwise KizKiz’ = 0. Because of the binary 0/1 values, Kiz = Kizn. These properties are the

same for Giz. In the subsequent analysis, we will derive expected values over an infinite

number of draws using symbol E(). In this regard, E(Kizn) = pik and E(Giz) = E(Giz

n) = pi in the

context of multinomial draws. Furthermore, E(Kiz Kiz’ …Kizn) denotes the probability of

drawing n different individuals in sample k that belong to the same species, which is, by

definition, the raw moment Fn(k) (see Appendix C).

Fintra(k) is the probability of drawing two individuals that belong to the same species

conditional on drawing them in community sample k, while Finter(k) is the probability of

drawing two conspecific individuals conditional on the first being in k and the other in another

sample, and Fglobal(k) is the probability of drawing two conspecific individuals conditional on

the first being in k and the other being in any of the samples.

The theoretical relationship between these probabilities is (Munoz et al. 2008):

( ) )()(1)()()( kFkkPkFkkPkF interintraglobal −+= (B1)

Where P(k/k) is the probability of drawing a second individual from sample k when the first

has been already drawn from k.

Exact estimators of Fintra(k), Finter(k) and Fglobal(k) are:

∑ −−

=i k

ik

k

ikintra N

NNN

kF11

)(ˆ , ∑ −−

=i k

iki

k

ikinter NN

NNNN

kF )(ˆ and ∑ −−

=i

i

k

ikglobal N

NNN

kF11

)(ˆ

where Nk and N are the number of individuals in sample k and in the overall dataset,

respectively, while Nik and Ni are the numbers of individuals belonging to species i in sample

k and in the overall dataset, respectively, such as ∑=i

iNN and ∑=i

ikk NN . These statistics

comply with equation (B1).

Using indicator variables, the probability of drawing in sample k two different

individuals, z and z’, belonging to the same species is: ∑=i

izizintra KKEkF )()( ' and

∑≠

=−ji

jzizintra KKEkF )()(1 ' . Likewise, ∑=i

izizinter KGEkF )()( ' is the probability for z taken in

k and z’ in a different sample, and ∑=i

izizglobal KGEkF )()( ' is the probability for any pair of

distinct z in k and z’ in the overall dataset. In the subsequent analysis, we will investigate the

sampling properties of the similarity statistics in detail using the properties of Giz and Kiz.

Expected value of )(ˆ kFintra

( )∑ ∑∑∑ ⎟⎠

⎞⎜⎝

⎛−

−=

−−

=i z

izz

izkki k

ik

k

ikintra KK

NNNN

NN

kF 11

111

)(ˆ is defined for a given multinomial draw

from the source community.

( ) ( ) ( ) ( ) ( )∑ ∑∑∑ ⎟⎠

⎞⎜⎝

⎛−+

−=

≠i ziz

zziziz

ziz

kkintra KEKKEKE

NNkFE

''

2

11)(ˆ

In the context of multinomial draws, we have: ( ) ( ) ( ) ikkizkz

izz

iz pNKENKEKE === ∑∑ 2

and ( ) )()1(' kFNNKKE intrakkz

iziz −=∑ for z ≠ z’.

Although there is replacement, Kiz and Kiz’ are not independent because Cov(Kiz,Kiz’)=

( ) ( ) ( ) 0)('' ≠−=− kFNKEKEKKE intrakiziziziz .

This yields: ( ) ( ) ( )( )1ˆ ( ) 1 ( ) ( )1intra k ik k k intra k ik intra

k k

E F k N p N N F k N p F kN N

= + − − =−

. (B2)

∑ −−

=i k

ik

k

ikintra N

NNN

kF11

)(ˆ is thus an unbiased estimator of Fintra(k), while the approximated

version 2

)(ˆ ∑ ⎟⎟⎠

⎞⎜⎜⎝

⎛=

i k

ikintra N

NkF is biased (see Lande 1996).

Variance of )(ˆ kFintra

( )( )

( ) 22

22

2

11

111

)(ˆintra

iikik

kkintra

i k

ik

k

ikintra FNNE

NNF

NN

NN

EkFVar −⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−

−=⎟

⎟

⎠

⎞

⎜⎜

⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−

−−

= ∑∑

( ) ( ) ( ) ( )∑∑∑≠

−−+−=⎟⎠

⎞⎜⎝

⎛−

jijkjkikik

iikik

ikiki NNNNNNNN 1111 22

2

( ) ( ) ( )∑∑∑≠

+−++−=⎟⎠

⎞⎜⎝

⎛−

jijkikjkikjkik

iikikik

ikiki NNNNNNNNNNN 222234

2

221

In the following calculations, we will use the indicator variables Kiz linked to species

abundances through ∑=z

izik KN .

• Terms that sum over species i:

(i) ∑ ∑∑∑ ⎟⎠

⎞⎜⎝

⎛ +=≠i zz

izizz

izi

ik KKKN'

'22 and thus )()1( int

2 kFNNNNE rakkki

ik −+=⎟⎠

⎞⎜⎝

⎛∑ )

(ii) ∑ ∑∑∑∑ ∑∑ ⎟⎠

⎞⎜⎝

⎛++=⎟

⎠

⎞⎜⎝

⎛=

≠≠≠i zzziziziz

zziziz

ziz

i ziz

iik KKKKKKKN

''''''

'

2'

33

3 3

Hence, )()2)(1()()1(3 33 kFNNNkFNNNNE kkkintrakkk

iik −−+−+=⎟⎠

⎞⎜⎝

⎛∑

(iii)

∑ ∑∑∑∑∑∑ ⎟⎠

⎞⎜⎝

⎛ ++++=≠≠≠≠≠≠≠i zzzz

izizizizzzz

izizizzz

izizzz

izizz

izi

ik KKKKKKKKKKKKN'''"'

'''''''''

2'''

'

3'

'

2'

244 643

Hence,

( ) ( )( ) ( )( )( ) )(321)(216)(17 434 kFNNNNkFNNNkFNNNNE kkkkkkkintrakkk

iik −−−+−−+−+=⎟⎠

⎞⎜⎝

⎛∑

• Terms that sum over different species i and j:

(i) ∑ ∑∑ ∑∑∑ ∑∑∑≠ ≠≠ ≠≠≠

⎟⎠

⎞⎜⎝

⎛=⎟⎠

⎞⎜⎝

⎛ +=⎟⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛=ji zz

jzizji zz

jzizz

jzizji z

jzz

izji

jkik KKKKKKKKNN'

''

' , and

hence )1)(1( intrakkji

jkik FNNNNE −−=⎟⎟⎠

⎞⎜⎜⎝

⎛∑≠

(ii) ∑ ∑∑∑ ∑∑∑≠ ≠≠≠≠≠

⎟⎠

⎞⎜⎝

⎛+=⎟

⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛=

ji zzzjzjziz

zzjziz

ji zjz

ziz

jijkik KKKKKKKNN

''''''

'

2'

22

)( ''' jzjziz KKKE is the probability of drawing three distinct individuals that belong to two

species. F3(k) is the probability of drawing three individuals that belong to the same species

and Fintra(k) is the probability of drawing two individuals that belong to the same species, so

that )()()( 3''' kFkFKKKE intrajzjziz −= .

Hence, ( ) ( ))()()2)(1()(1)1( 32 kFkFNNNkFNNNNE intrakkkintrakk

jijkik −−−+−−=⎟⎟⎠

⎞⎜⎜⎝

⎛∑≠

(iii) ∑ ∑∑∑≠≠

⎟⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛=

ji zjz

ziz

jijkik KKNN

2222

∑ ∑∑∑∑∑≠ ≠≠≠≠≠≠≠≠≠

⎟⎠

⎞⎜⎝

⎛ +++=ji zzzz

jzjzizizzzz

izjzjzzzz

jzizizzz

jzizji

jkik KKKKKKKKKKKKNN''''''

''''''"'

2'''

"'

2"'

'

2'

222

)( '''''' jzjziziz KKKKE is the probability of drawing four distinct individuals falling into two

conspecific pairs of different species. F4(k) is the probability of drawing four individuals that

belong to the same species and (Fintra(k))² is the probability of drawing two conspecific pairs,

and hence ( ) )()()( 42

'''''' kFkFKKKKE intrajzjziziz −= .

This yields: ( ) ( )( ) ⎟⎟

⎠

⎞⎜⎜⎝

⎛

−−−−+

−−−+−−=⎟⎟

⎠

⎞⎜⎜⎝

⎛∑≠ )()()3)(2)(1(

)()()2)(1(2)(1)1(

42

intra

3intraintra22

kFkFNNNN

kFkFNNNkFNNNNE

kkkk

kkkkk

jijkik

In summary, the exact variance of )(ˆ kFintra simplifies to:

( ) ( ) ( ) ( )( )23

1ˆ ( ) 2 ( ) 6 4 ( ) 4 2 ( )1intra intra k intra k

k k

Var F k F k N F k N F kN N

= + − + −−

(B3)

The first order series expansion according to 1/Nk reads:

( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛+−≈ 2

2int3

1)()(4)(ˆk

rak

intra NOkFkF

NkFVar .

This first order expansion also holds for the approximate statistic 2

)(ˆ ∑ ⎟⎟⎠

⎞⎜⎜⎝

⎛=

i k

ikintra N

NkF (see

Lande 1996).

Expected value of )(ˆ kFglobal and )(int kF er

( )1 1ˆ ( ) 11 1

ik iglobal iz iz

i i z zk k

N NF k K GN N N N

− ⎛ ⎞= = −⎜ ⎟− − ⎝ ⎠∑ ∑∑ ∑

The indicator function Giz here enlarges the sampling context to the overall dataset.

( ) ( ) ( ) ( ) ( )''

1ˆ ( )1global iz iz iz iz iz

i z z z zk

E F k E K G E K G E KN N ≠

⎛ ⎞= + −⎜ ⎟− ⎝ ⎠

∑ ∑ ∑ ∑

( ) ( ) ( )''

1ˆ ( )1global iz iz

i z zk

E F k E K GN N ≠

=− ∑∑

When summing over z ≠ z’, z being an individual in plot k, we need to distinguish two

cases, namely that z’ can be either an individual of plot k or of another sample, hence:

( )''

( 1) ( ) ( ) ( )iz iz k k intra k k interi z z

E K G N N F k N N N F k≠

= − + −∑∑ .

This yields: ( ) ( 1) ( ) ( ) ( )ˆ ( ) ( )1

k intra k interglobal global

N F k N N F kE F k F kN

− + −= =

−.

Thus, )(ˆ kFglobal is an unbiased estimator of Fglobal(k).

On the other hand, ( )

1ˆ ( ) ik i ikinter iz iz iz

i i z z zk k k k

N N NF k K G KN N N N N N

− ⎛ ⎞= = −⎜ ⎟− − ⎝ ⎠∑ ∑∑ ∑ ∑

( ) ( ) ( ) ( ) ( ) ( )2' '

' '

1ˆ ( )inter iz iz iz iz iz iz izi z z z z z zk k

E F k E K G E K G E K E K KN N N ≠ ≠

⎛ ⎞= + − −⎜ ⎟− ⎝ ⎠

∑ ∑ ∑ ∑ ∑

( ) ( ) ( ) ( ) ( ) ( )int ' ' '' ' '

, '

1 1ˆ ( )er iz iz iz iz iz izi z z z z i z zk k k k

z k z k

E F k E K G E K K E K GN N N N N N≠ ≠ ≠

∈ ∉

⎛ ⎞⎛ ⎞ ⎜ ⎟= − =⎜ ⎟ ⎜ ⎟− −⎝ ⎠ ⎜ ⎟

⎝ ⎠∑ ∑ ∑ ∑ ∑ , which

simplifies to ( ) )()(int kFkFE interer = .

Hence, )(int kF er is conversely an unbiased estimator of Finter(k).

Appendix B: Delta method to assess the sampling mean and variance of twice

differentiable functions

The one-variable case

Let )(XfZ = be a function of the random variable X, of which expectation X and

variance Var(X) are known. The derivative functions of f are:

xxfxf x ∂

∂=

)()(' , 2

2'' )()(2 x

xfxfx ∂

∂= ,

yxxfxf xy ∂∂

∂=

)()(2

'' ,yxxfxf yx ∂∂

∂= 2

3)3( )()(2 , and so on.

Assuming that X takes values x around X , the second order Taylor series expansion of

f(X) in the neighborhood of X is: ( ) ( ) )(21)()( "2' XfXxXfXxXfZ xx −+−+≈ .

This implies:

)()(21)( " XfxVarXfZ x+≈ (C1)

and ( ) ( )( ))()(21)( 2"' XVarXxXfXfXxZZ xx −−+−≈− .

Finally, ( ) ( ) 2'22 )(XfXxZZ x−≈− at the second order around X and hence

2' )()()( XfXVarYVar x≈ . (C2)

The two-variable case

Now let Z = g(X, Y) be a differentiable function of two random variables X and Y. Its

second order series expansion around ( X ,Y ) is:

( ) ( ) ( )( )( ) ( )2 2

' ' "

2 2" "

( , ) ( , ) ( , ) ( , )

1 1( , ) ( , )2 2

x y xy

x y

Z g X Y x X g X Y y Y g X Y x X y Y g X Y

x X g X Y y Y g X Y

≈ + − + − + − −

+ − + −

and thereby

),()(21),()(

21),(),(),( """

22 YXgYVarYXgXVarYXgYXCovYXgZyxxy +++≈ (C3)

Furthermore,

( ) ( ) ( )( ) ( ) ( )

⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜

⎝

⎛

−−−

−+−+−−+−+−≈−

),()(21),()(

21),(),(

),(21),(

21),(),(),(

"""

"2"2"''

2

22

YXgYVarYXgXVarYXgYXCov

YXgYyYXgXxYXgYyXxYXgYyYXgXxZZ

yxxy

yxxyyx

é

( ) ( ) ( ) ( )( ) ),(),(2),(),( ''2'22'22 YXgYXgYyXxYXgYyYXgXxZZ yxyx −−+−+−≈−

and finally,

( ) ),()(),(),(2),()(),()()( ''2'2' YXgYEyYXgYXCovYXgYVarYXgXVarZVar xyyx −++≈ (C4)

Appendix C: Raw moments of the local species abundance distribution (SAD) in the

context of the two-level spatially implicit neutral model (2L-SINM) of Hubbell (2001)

The raw moment Fn(k) associated with the species abundance distribution (SAD) in a

local community sample k is the probability of drawing n individuals that belong to the same

species in k, and as such it can be written down as: 1

( ) (1 ) ( )n

n meta commj

F k p j p j n=

= ∑

where )( njpcomm is the probability of drawing n individuals in sample k that belong to j

different lineages, that is, which are the descendants of j independent migrant ancestors.

)1( jpmeta is the probability that the j independent migrant ancestors belong to the same

species.

I(k) is the immigration number of community k, that is, the number of immigrants

available for replacement in the community when an individual dies. We consider the

coalescent-type reasoning of Etienne and Olff (2004) and Etienne (2005), that is, when we

draw a new individual in a local community k, increasing the sample size from n to n+1, the

probability that the (n+1)th individual is the descendant of a lineage not represented in the

initial sample is nkI

kI+)()( , while the probability that the (n+1)th individual is the descendant

of a lineage already represented is conversely nkI

n+)(

.

Hence,

nkIkInjp

nkInnjpnjp commcommcomm +

−++

=+)(

)()1()(

)()1(

According to Ewens (1972), ( )s

j

jsmeta

ljsp θ

θ,)( = in Hubbell’s metacommunity undergoing

speciation-drift equilibrium, and, likewise, ( )j

n

njcomm kI

kIl

njp )()(

)( ,= in the local

community, where lj,n is the unsigned Stirling number of the first kind and (.)j is the

Pochhammer symbol.

In particular, ( ) ( ) jj

jmeta

jljp

θθθ

θ)!1()1( ,1 −

== and hence

( ) ( )∑∑==

−==

n

j j

j

n

njmeta

n

jcommn

jkIkI

ljpnjpkF

1

,

1

)!1()()(

)1(.)()(θ

θ (A1)

The first raw moments of the species abundance distribution (SAD) then read:

( ) ( ) ( ) ( ) ⎟⎠⎞

⎜⎝⎛

+−

+−=

+++

+=+=

111

1)()(1

11

1)()(

1)(1

)()()(

2

2

2

2,2

12

2,12 θθθ

θθθ

kIkI

kIkI

kII

kIl

IkI

lkF ,

hence1)(

)(1

1)(2 ++−=

kIkIkF

θθ , as provided by Etienne (2005).

For higher orders, we likewise get ( )( ) )3()1(

2)(1)()2)(1(2)2)((3)(2)(

2

3 +Γ+Γ

+++++++

=θθθθθ

kIkIkIkIkF and

( )( )( ) )4()3()(.

3)(2)(1)()4())2)(1(6)2)((11)(12()3()(6)(

23

4 +Γ+ΓΓ

++++Γ+++++++Γ

=θθθθθθθθθ

kIkIkIkIkIkIkF .

Estimating immigration in neutral communities: theoretical and practical insights into the sampling...

Documents

Transcript of Estimating immigration in neutral communities: theoretical and practical insights into the sampling...