Adaptive cluster sampling with a data driven stopping rule

21
Stat Methods Appl (2011) 20:1–21 DOI 10.1007/s10260-010-0149-5 Adaptive cluster sampling with a data driven stopping rule Stefano A. Gattone · Tonio Di Battista Published online: 1 October 2010 © Springer-Verlag 2010 Abstract The adaptive cluster sampling (ACS) is a suitable sampling design for rare and clustered populations. In environmental and ecological applications, biologi- cal populations are generally animals or plants with highly patchy spatial distribution. However, ACS would be a less efficient design when the study population is not rare with low aggregation since the final sample size could be easily out of control. In this paper, a new variant of ACS is proposed in order to improve the performance (in term of precision and cost) of ACS versus simple random sampling (SRS). The idea is to detect the optimal sample size by means of a data-driven stopping rule in order to determine when to stop the adaptive procedure. By introducing a stopping rule the theoretical basis of ACS are not respected and the behaviour of the ordinary estimators used in ACS is explored by using Monte Carlo simulations. Results show that the pro- posed variant of ACS allows to control the effective sample size and to prevent from excessive efficiency loss typical of ACS when the population is less clustered than anticipated. The proposed strategy may be recommended especially when no prior information about the population structure is available as it does not require a prior knowledge of the degree of rarity and clustering of the population of interest. Keywords Adaptive cluster sampling · Monte carlo simulation · Stopping rule · Efficiency S. A. Gattone (B ) Department SEFeMeQ, University of Tor Vergata, Rome, Italy e-mail: [email protected] T. Di Battista Department DMQTE, University G. d’Annunzio of Chieti-Pescara, Chieti-Pescara, Italy e-mail: [email protected] 123

Transcript of Adaptive cluster sampling with a data driven stopping rule

Stat Methods Appl (2011) 20:1–21DOI 10.1007/s10260-010-0149-5

Adaptive cluster sampling with a data drivenstopping rule

Stefano A. Gattone · Tonio Di Battista

Published online: 1 October 2010© Springer-Verlag 2010

Abstract The adaptive cluster sampling (ACS) is a suitable sampling design forrare and clustered populations. In environmental and ecological applications, biologi-cal populations are generally animals or plants with highly patchy spatial distribution.However, ACS would be a less efficient design when the study population is not rarewith low aggregation since the final sample size could be easily out of control. Inthis paper, a new variant of ACS is proposed in order to improve the performance (interm of precision and cost) of ACS versus simple random sampling (SRS). The ideais to detect the optimal sample size by means of a data-driven stopping rule in orderto determine when to stop the adaptive procedure. By introducing a stopping rule thetheoretical basis of ACS are not respected and the behaviour of the ordinary estimatorsused in ACS is explored by using Monte Carlo simulations. Results show that the pro-posed variant of ACS allows to control the effective sample size and to prevent fromexcessive efficiency loss typical of ACS when the population is less clustered thananticipated. The proposed strategy may be recommended especially when no priorinformation about the population structure is available as it does not require a priorknowledge of the degree of rarity and clustering of the population of interest.

Keywords Adaptive cluster sampling · Monte carlo simulation · Stopping rule ·Efficiency

S. A. Gattone (B)Department SEFeMeQ, University of Tor Vergata, Rome, Italye-mail: [email protected]

T. Di BattistaDepartment DMQTE, University G. d’Annunzio of Chieti-Pescara, Chieti-Pescara, Italye-mail: [email protected]

123

2 S. A. Gattone, T. Di Battista

1 Introduction

It is well known that adaptive cluster sampling (ACS) (Thompson 1990; Thompsonand Seber 1996) has been introduced as a suitable sampling design in order to esti-mate parameters of rare and clustered biological populations (Smith et al. 2004). For adetailed review of the major developments and issues in ACS, see Turk and Barkowski(2005).

Adaptive designs are such that sampling is allocated to local areas where theobserved value of the selected units satisfies a condition of interest. The selectionof an adaptive cluster sample depends on the population values observed in the field.Consider a study region partitioned into N spatial units labelled by {1, 2, . . . , N } andlet yi denote the y-value associated with the i-th unit. We will consider the populationto be fixed rather than a realization of a random variable. Starting with the selectionof an initial set of n units, the neighbourhood of each unit may be added to the samplewhenever the unit satisfies a previously fixed condition C in one-dimensional realspace, usually of the form C = {yi > c}. If any of these additional units satisfies thecondition, its neighborhood will be added to the sample. A network may be defined asa collection of units with the property that the selection of any unit within the networkwould lead to the inclusion in the sample of every other unit in the network. If theinitial unit does not satisfy C , no further units are added to the sample and the initialunit is defined to be a network of size one. The units adaptively sampled which donot satisfy C are called edge units. A network plus its edge units forms a cluster. Thedistinct networks form a partition of the N units {A1, A2, . . . , AK } with K ≤ N .

Finally, performing an ACS design requires

− the selection of an initial sample by using a probability sampling procedure− the definition of the condition C for additional sampling− the definition of the concept of neighbourhood.

One practical concern in real life applications of ACS is the uncertainty of the finalsample size. If the networks are very large, the final sampling effort and the total costof the survey may run out of control. As a matter of fact, field investigators may dealwith a poor performance of the ACS design in real surveys (Goldberg et al. 2007;Magnussen et al. 2005) as its efficiency depends critically on the population struc-ture. If no prior information about the rarity and the aggregation of the population isavailable, it may be difficult to design an efficient adaptive cluster sample (Smith etal. 1995; Brown 2003; Smith et al. 2003).

There are best current practices to improve the performance (in term of precisionand cost) of existing ACS methods versus simple random sampling, but almost all needa priori information on the structure of the population. Thompson and Seber (1996,Chap. 5) and Thompson (1996) proposed the use of stratification, order statistics andpartitioning into blocks in order to limit sampling. A two stage ACS was proposedby Salehi and Seber (1997a) where information from a pilot study could be usefulin order to set the size and the number of primary units. An adjusted two-stage ACSwas proposed by Muttlak and Khan (2002) where large networks, identified by usinga rapid assessment auxiliary variable, are subsampled and small networks are com-pletely enumerated. Inverse ACS was proposed by Christman and Lan (2001) where

123

Adaptive cluster sampling with a data driven stopping rule 3

the initial sample is taken by general inverse sampling. In Restricted ACS (Brownand Manly 1998) the sampling effort is fixed in advance and once it is reached thesampling is stopped at the end of the current adaptive step. Lo et al. (1997) and Su andQuinn (2003) proposed a stopping rule defining a stopping level S where the samplingprocess is terminated at the S-th step of the neighbouring unit search. The stoppingrule is fixed in advance, prior to sampling. Thus, the best choice of the stopping levelis not known unless to have some prior knowledge about the population. Furthermore,the stopping rule proposed by Su and Quinn (2003) causes biases in the ordinary Han-sen–Hurwitz (HH) and Horvitz–Thompson (HT) estimators. To this purpose, Salehiand Seber (2002) have shown that Murty’s estimator is an unbiased estimator whenused with the restricted variants of ACS proposed by Brown and Manly (1998) andby Christman and Lan (2001).

This paper can be set along this framework with the aim of finding a variant of theordinary ACS, easy to implement, which enlarges the circumstances under which theACS strategy could be efficient in terms of precision and cost. Our idea is to modifythe sampling effort by means of a data driven procedure. Instead of having a criterionfixed prior to the survey we propose a stopping rule that changes at each step of theaggregative procedure. In the spirit of the adaptive procedure we want the data to sayus when to stop sampling in a sort of sequential ACS. Data driven stopping rules havebeen successfully applied in other areas of research such as time series analysis andgeneralized linear models (Zacks 2009).

In Sect. 2 the proposed variant of ACS is illustrated. Since the theoretical proper-ties of the proposed method could not be analytically stated, in Sect. 3, a simulationstudy is conducted on a wide range of artificial populations in order to empiricallyinvestigate the performance of the stopping rule proposed. Results are given in Sect. 4.Discussion is given in Sect. 5.

2 The method

In order to overcome the problem related to the uncertainty of the final samplingfraction in this section we propose a data-driven procedure which stops the adaptiveselection of units whenever a given stopping rule is verified.

Let us denote with s1 = {1, 2, . . . , i, . . . n} an initial simple random sample of sizen. For each unit i ∈ s1 we can view the adaptive sampling procedure as a set of steps.At step l = 0 we have just the initial unit i . If unit i satisfies a given condition C thenthe adaptive sampling phase is carried out in the neighbourhood of i so to have at thenext step l = 1 the initial unit i plus its neighbourhood. Similarly, at step l = 2 theneighbourhood of units satisfying the condition C at step l = 1 is added and so onuntil when the condition C is satisfied. Let

AS1i =

{i, i1

1 , i12 , . . . , i1

k1i

}

be the set of indexes labelling the units sampled after the first step (l = 1) of theaggregative procedure started from the i-th sampled unit. Accordingly, at step l = 2we have

123

4 S. A. Gattone, T. Di Battista

AS2i = (i, i1

1 , i12 , . . . , i1

k1i, i2

1 , i22 , . . . , i2

k2i).

In general, the set of units adaptively sampled after the l-th step are labelled by

ASli = (i, i1

1 , i12 , . . . , i1

k1i, i2

1 , i22 , . . . , i2

k2i, . . . , i l

1, i l2, . . . , i l

kli).

For each step l we may define the average of units aggregated from the i-th initial unitas

w(l)i =

∑j∈ASl

iy j

m(l)i

(1)

where m(l)i is the cardinality of ASl

i . While

s2(l)i =

∑j∈ASl

i(y j − w

(l)i )2

m(l)i

(2)

represents the within-network variance at the l-th step for the i-th initial unit. The stop-ping rule proposed in this paper follows from the analysis of the equation expressingthe condition under which ACS is more efficient than SRS when the Hansen–Hurwitzestimator is adopted (Thompson 1990):

σ 2W N

(1 − n

N

1 − nE(v)

)> σ 2. (3)

where σ 2 is the population variance, E(v) is the expected final sample size of theadaptive process and σ 2

W N = 1K

∑Kk=1

1mk

∑j∈Ak

(y j − wk)2 is the within-network

variance where wk represents the average of the observations in the k-th network.Formally, ACS will be more efficient than SRS when the left-hand side of (3) is largerthan σ 2. It could happen that all the initial units in the sample do not satisfy the con-dition for extra sampling. In such a case v = n and the left-hand side of Eq. (3) goesto infinity. In Rocco (2003) a variant of ACS, named Constrained Inverse ACS, isproposed in order to ensure that v > n.

In designing an efficient adaptive cluster sample one has to consider the relationbetween σ 2 and σ 2

W N and n and v. The final sample size can be reduced by increasingthe stopping value c in the condition C . On the other hand, if the networks are verysmall, the disadvantage of having a relatively small within-network variance σ 2

W Ncompared with σ 2 will become more important than the advantage of having the finalsampling fraction close to the initial sampling fraction. Therefore, we have to copewith a trade-off between σ 2

W N and v. Of course, it is a characteristic of ACS that bothv and σ 2

W N are not known beforehand. Before sampling takes place one does not knowwhich would be the best choice for the stopping value c. If the critical value is set toolow, the final sample size will be excessively large as almost all the units will satisfythe condition.

123

Adaptive cluster sampling with a data driven stopping rule 5

Relative efficiency of ACS compared with SRS is a result of how networks aredefined by the sampling design. We can define condition (3) for each step l of theadaptive process as follows:

2(l)W N

(1 − n

N

1 − nE(v(l))

)> σ 2

}(4)

where σ2(l)W N and E(v(l)) are the within network variance and the expected final sample

size of the ACS when the sampling process is completed after l-th steps.In Thompson (1990) are stated the reasons why expression (3) ensures that the

adaptive strategy will have lower variance than the sample mean of a SRS of size v.Denoting with D∗ the set of distinct units included in the sample plus the numberof times each unit is included in the sample, with μH H and y the modified Han-sen–Hurwitz mean estimator and the sample mean estimator on the initial sample,respectively, Thompson (1990) shows that the result holds because var(μH H ) =var(y) − E

[var(y|D∗)

]since μH H = E(y|D∗). Thus the variance of μH H will

always be less than or equal to the variance of y. Similarly, at each l-th step, condition(4) ensures efficiency of ACS if at least more than one initial selection of n units willlead to the same D∗(l), i.e. E

[var(y|D∗(l))

]> 0 where D∗(l) denotes the set of units

sampled at the l-th step. If it is not the case then we have that E[y|D∗(l)

] = y andE

[var(y|D∗(l))

] = 0.As before sampling takes place σ 2 is unknown, Eq. (4) could not be solved but we

could use the left-hand side of the equation in order to compare the relative efficiencyat the various steps of the aggregative procedure. Then, from (4) it follows that anACS strategy which stops at the l-th step of the aggregative procedure will be moreefficient than an ACS strategy of (l − 1) steps if the following event is verified:

2(l)W N

σ2(l−1)W N

1 − nv(l−1)

1 − nv(l)

> 1

}. (5)

It is apparent how efficiency depends on the increase of the within-network variancerelative to the number of sampled units in moving to the further step of the aggregativeprocedure. Our idea consists in applying condition (5) for each initial unit i and ateach step l of the adaptive process. To this purpose, a straightforward estimate of v(l)

is given by∑n

i=1 = m(l)i while an estimate of the within-network variance σ

2(l)W N is

given by

1

n

n∑i=1

s2(l)i . (6)

In our proposed variant, the step l = 1 of the aggregative procedure is not modifiedwith respect to the ordinary ACS. Starting from l = 2, for each initial unit i ∈ s1we propose to sample units in the network that contains units i only if the followingcriterion is satisfied:

123

6 S. A. Gattone, T. Di Battista

Sli =

⎧⎨⎩

s2(l)i

s2(l−1)i

1 − 1m(l−1)

i

1 − 1m(l)

i

> 1

⎫⎬⎭ . (7)

Sli accomplishes with the trade-off between σ 2

W N and v. Indeed, at higher steps, thenumber of neighbouring units increases so we ask for more extra information—higherwithin-network variance s2(l)

i —in order to carry on sampling. Instead of having a cri-terion fixed prior to survey, the stopping rule Sl

i changes at each step of the aggregativeprocedure and for each unit i in the initial sample s1. Hence, units meeting the criterionvary from sample to sample.

For each unit in the initial sample s1 we implement the adaptive procedure with theproposed stopping rule and whenever condition Sl

i is not satisfied we stop sampling andthe i-th network, say ASi , will be truncated at the (l −1)-th step. The units aggregatedat the l-th step will be considered as edge units. Comments about the treatment of theedge units will be given in the discussion section. Finally, we will have n networks,say AS1, AS2, . . . ASn , on which the modified HH and HT estimators will be suitablyapplied. A graphical flow-chart showing the process of ACS with our stopping rule isdisplayed in Fig. 1.

The modified HH estimator for the mean is

μH H = 1

n

n∑i=1

wi (8)

where wi = 1mi

∑j∈ASi

y j is the mean of the observations in ASi . The sample var-iance of the HH estimator could be unbiasedly estimated by (Thompson and Seber1996)

ˆvar(μH H ) = N − n

Nn(N − 1)

n∑i=1

(wi − μH H )2. (9)

The modified HT estimator of the mean takes the form

μHT = 1

N

r∑k=1

y∗k

αk(10)

where y∗k is the sum of the y-values in the k-th network, r is the number of distinct

networks in a sample and αk is the probability that network k is included in the sample.An unbiased estimator of the variance of μHT is given by (Thompson and Seber 1996)

ˆvar(μHT ) = 1

N 2

⎡⎣

r∑j=1

r∑k=1

y∗j y∗

k

(α jk − α jαk

α jαk

)⎤⎦ (11)

where α jk are the second-order inclusion probabilities. The modified HT estimator(10) is based on the fact that αk = αi for every unity i in network k. With our proposed

123

Adaptive cluster sampling with a data driven stopping rule 7

Fig. 1 Flow chart of ACS with a data driven stopping rule

variant, networks AS1, AS2, . . . ASn do not form a partition of the population and mayintersect as the adaptive sample initialized by two different units might overlap. For theunits which belong to only one network, an approximation of the first-order inclusionprobability may be obtained by the usual formula (Thompson 1990):

αi = 1 −(N−mi

n

)(N

n

) . (12)

For the units which belong to more than one network, expression (12) has to be slightlymodified as follows

αi = 1 −(N−m∗

in

)(N

n

) (13)

123

8 S. A. Gattone, T. Di Battista

where

m∗i =

n∑j=1

m j I ij

where I ij is an indicator function which takes a value of 1 if unit j belongs to the

i-th sampled network ASi , and 0 otherwise. In practice, the networks that overlapare merged and the inclusion probability are evaluated on the basis of the size of theresulting network. In our framework, the summation of the HT estimator would beover the distinct units sampled and not over the distinct networks. Therefore, with ourproposed variant of ACS, a suitable version of the modified HT estimator will be givenby

μHT = 1

N

v∑i=1

yi

αi(14)

where y1, y2, . . . , yv represent the y-values from the v distinct labels in the finalsample.

The theoretical basis for the unbiasedness of the estimators for adaptive samplingrely on the fact that networks, built throughout this sampling design, are disjoint, donot overlap and form a unique partition of the population for a specified criterion. Inthe ordinary ACS the first order inclusion probabilities are computed using the factthat if a unit in a network is included in the initial sample then every unit in that net-work is included in the final sample. Our design does not have this property becauseof the use of a stopping rule which determines the premature end of the aggregativeprocedure. As a matter of fact, with our proposed variant of ACS the modified HH andHT estimators given in (8) and (14) turn out to be biased. The bias is the price one hasto pay in order to limit the sampling effort. The key point is that it is not possible tohave a good evaluation of the inclusion probabilities of each unit which can only beapproximated by means of the size of the truncated networks. The effect of using thestopping rule on the properties of the modified HH and HT estimators has also beenevaluated by Su and Quinn (2003).

3 Monte-carlo simulation

No theoretical results may be obtained on the performance of the stopping rule pro-posed in Sect. 2. Accordingly, similar to Brown and Manly (1998) and Su and Quinn(2003), we evaluate the proposed sampling design by simulating artificial populationsby means of a Poisson Cluster Process (Diggle 1983). We choose to use the same20×2×3 factorial design used by Brown (2003) as this article had the aim to evaluateby means of a thorough simulation study the ACS efficiency compared with SRS underdifferent survey design factors.

The study area was a square grid divided into 30 × 30=N=900 plots. For anypopulation, the number of parents was a realization from a Poisson process with mean

123

Adaptive cluster sampling with a data driven stopping rule 9

0 5 10 15 20 25 30−30

−20

−10

1=5, λ

2=10, θ=0.5

0 5 10 15 20 25 30−30

−20

−10

1=20, λ

2=10, θ=0.5

0 5 10 15 20 25 30−30

−20

−10

1=5, λ

2=20, θ=1.5

0 5 10 15 20 25 30−30

−20

−10

1=20, λ

2=20, θ=3.5

0 5 10 15 20 25 30−30

−20

−10

1=40, λ

2=20, θ=0.5

0 5 10 15 20 25 30−30

−20

−10

1=80, λ

2=20, θ=3.5

Fig. 2 Examples of populations generated by a Poisson cluster process with different values of λ1, λ2and θ

λ1 = 5, 10, . . . , 100. Parents were randomly located within the study area. For eachparent, the number of children was generated according to a Poisson random vari-able with mean λ2 = 10, 20. Children were randomly placed around the parents ata random angle uniformly distributed between 0◦ and 360◦ and at a distance takenfrom an exponential distribution with mean θ = 0.5, 1.5, 3.5. The combination ofthese parameter values allows us to cover various levels of clustering: from rare andtightly aggregated populations (θ = 0.5, λ2 = 10, low values of λ1) for which ACSoutperforms SRS, to more sparse and less clustered populations (θ = 3.5, λ2 = 20and high values of λ1) for which it is well known that ACS is not suited.

Different realizations of the Poisson cluster process are given in Fig. 2 in order toillustrate how the populations distributed over the study region. Within the study areaa central area of 20 × 20 plots was defined as the sampling area to allow for edgeeffects from the Poisson cluster process.

For each population, M = 10000 sampling simulations were conducted by usingSRS, ordinary ACS and ACS with our proposed stopping rule. ACS designs werecarried out by selecting an initial sample of size n = 5, 10, . . . , 25. Sample size ofSRS was set up to the effective sample size of the adaptive procedures, namely E(v).The condition for adaptive sampling was C = {yi > 1}. The simulation study wasbased on without replacement sampling. For each population and for each design,HH and HT estimators of the population mean were evaluated by using Eqs. (8) and(14), respectively. Estimates of the sample variance from our proposed variant of ACSwere also evaluated by using the conventional estimators given in Eqs. (9) and (11).

123

10 S. A. Gattone, T. Di Battista

Among the sampling simulations the mean square error (MSE) of the estimators wasrecorded. Only for the design with the stopping rule, relative bias of the estimatorswas also evaluated. In particular we have:

M SE =M∑

i=1

(μi − μ)2/M

E(v) = 1

M

M∑i=1

vi

RB = 1

M

M∑i=1

(μi − μ)/μ

where μi is the mean estimate and vi is the final sample size at the i-th samplingsimulation. Similar to Su and Quinn (2003), vi was evaluated without considering theedge units so that vi represents the number of sampling units used in the estimators.The bias in the estimated variance was also evaluated by comparing for each popula-tion, the average of estimated variances with the actual variance of the M = 10000sample estimates. Then, the relative efficiency of adaptive cluster sampling with andwithout the stopping rule with respect to simple random sampling was computed asfollows:

reAC S = M SES RS

M SE AC S

reAC S∗ = M SES RS∗

M SE AC S∗

where M SES RS and M SES RS∗ are the mean square errors of the SRS estimatorscomputed with a sample size equal to the effective sample size of the ordinary ACS(E(v)AC S) and ACS with our data-driven stopping rule (E(v)AC S∗), respectively.With AC S∗ we denote our proposed ACS design with the stopping rule.

In reporting the results we will emphasize the behaviour of the sampling designswith respect to the cluster compactness that is the cluster size defined by the θ valuesand the number of individuals for each cluster defined by the λ2 values. In particu-lar we will be interested in the efficiency of AC S∗ relative to SRS compared withthe efficiency of ACS relative to SRS. Furthermore, values of the effective samplesize will be compared for both adaptive designs by evaluating the sampling fractionfn = E(v)AC S/N and f ∗

n = E(v)AC S∗/N .

4 Results

4.1 The influence of the stopping rule on the efficiency of the estimators

Results for the relative efficiency reAC S and reAC S∗ of the modified HH and HTestimators are reported in Figs. 3 and 4, respectively.

123

Adaptive cluster sampling with a data driven stopping rule 11

0 50 1000

0.5

1

1.5

λ1

0 50 1000

0.5

1

1.5

λ1

0 50 1000

0.5

1

1.5

λ1

0 50 1000

0.5

1

1.5

λ1

0 50 1000

0.5

1

1.5

λ1

0 50 1000

0.5

1

1.5

λ1

Fig. 3 Hansen–Hurwitz estimator: relative efficiency with respect to SRS of ordinary ACS (AC S) andACS with the data-driven stopping rule (AC S∗) with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20second row and θ = 0.5, 1.5, 3.5 (first, second and third column). AC S versus S RS (dotted line) and AC S∗versus S RS (solid line)

ACS and AC S∗ have similar behaviours for rare and clustered populations (λ2 =10, 20 and λ1 < 30). With compact clusters (θ = 0.5), as the population total increases(λ1 > 35), AC S∗ outperforms ACS and, more importantly, it is more efficient thanSRS for almost all the populations even those which are not rare. For less compactclusters (θ = 1.5, 3.5) and for high-density populations (λ1 > 30) both ACS designsare not as efficient as SRS. Anyway, by adopting our proposed stopping rule the adap-tive design has an efficiency loss not so relevant if compared to that of ordinary ACSwhich performs very poorly.

As it is well known, the modified HH estimator is less efficient than the modifiedHT estimator (Salehi 2003). This is confirmed by the results of the simulation studyalthough the improvement in efficiency ensured by the modified HT estimator is neg-ligible with our proposed variant of ACS. Ordinary ACS results to be more efficientthan ACS* only when the HT estimator is applied for some of the less rare and clus-tered populations (λ2 = 20, θ = 1.5, 3.5 and λ1 > 60). This can be explained by thegood behaviour of the HT estimator in presence of extremely high effective samplesize (Christman 1997). However, the ACS design results to be prohibitive in terms ofcost and time under these circumstances (sampling fraction fn > 0.6; see Figs. 9, 10and 11).

Results on relative efficiency have been reported only for an initial sample sizeof n = 15 since the behaviour of the stopping rule resulted to be quite similar with

123

12 S. A. Gattone, T. Di Battista

0 50 1000

0.5

1

1.5

λ1

0 50 1000

0.5

1

1.5

λ1

0 50 1000

0.5

1

1.5

λ1

0 50 1000

0.5

1

1.5

λ1

0 50 1000

0.5

1

1.5

λ1

0 50 1000

0.5

1

1.5

λ1

Fig. 4 Horvitz–Thompson estimator: relative efficiency with respect to SRS of ordinary ACS (AC S) andACS with the data-driven stopping rule (AC S∗) with λ1 ranging from 5 to 100, λ2 = 10 first row, λ2 = 20second row and θ = 0.5, 1.5, 3.5 (first, second and third column). AC S versus S RS (dotted line) and AC S∗versus S RS (solid line)

varying initial sample sizes. However, for the high-density and not clustered popula-tions, the appeal of our stopping rule with respect to the ordinary ACS is temperedwhen a relatively large initial sample size is used (n > 20).

Finally, in Tables 1, 2, 3, and 4, by means of factorial designs we report main effectsand interactions of the factors λ1, λ2 and θ on relative efficiency of ACS with and with-out the data driven stopping rule. As we would expect, all the three factors, degree ofrarity λ1, number of clusters λ2 and cluster compactness θ result to have a significanteffect on relative efficiency of ACS and ACS* both for HH and HT estimators with theexception of λ2 with HT and ordinary ACS. The interaction λ1θ is never significantin fact the efficiency loss observed when λ1 increases does not vary with the clustercompactness θ . Interestingly we note that the interaction effect between λ1 and λ2 isnot significant with ACS* while it is significant with ACS. As the degree of raritydecreases, the efficiency loss of the HH and HT estimators caused by the increaseof the number of clusters is mitigated by introducing our data driven stopping rule.Finally, the interaction effect between λ2 and θ which under ACS is not significantbecomes significant with AC S∗. In fact, the efficiency loss caused by the increase ofthe number of clusters λ2 observed both in ACS and in ACS* is mitigated for the morecompact clusters with our data driven stopping rule.

123

Adaptive cluster sampling with a data driven stopping rule 13

Table 1 ANOVA summarytable: reAC S of μH H

Source Sum of square df Mean square F p > F

λ1 7.622 19 0.40116 34.06 0λ2 0.3059 1 0.30593 25.98 0θ 4.2405 2 2.12025 180.04 0λ1 × λ2 0.4451 19 0.02342 1.99 0.0351λ1 × θ 0.3814 38 0.01004 0.85 0.6876λ2 × θ 0.0463 2 0.02314 1.96 0.1541Error 0.4475 38 0.01178Total 13.4886 119

Table 2 ANOVA summarytable: reAC S∗ of μH H

Source Sum of square df Mean square F p > F

λ1 2.45664 19 0.1293 21.45 0

λ2 0.026 1 0.026 4.31 0.0446

θ 3.47829 2 1.73914 288.46 0

λ1 × λ2 0.13469 19 0.00709 1.18 0.3258

λ1 × θ 0.21016 38 0.00553 0.92 0.6042

λ2 × θ 0.04392 2 0.02196 3.64 0.0357

Error 0.22911 38 0.00603

Total 6.5788 119

Table 3 ANOVA summarytable: reAC S of μHT

Source Sum of square df Mean square F p > F

λ1 4.8649 19 0.25605 2.97 0.0021

λ2 0.1776 1 0.17758 2.06 0.1591

θ 2.0933 2 1.04665 12.16 0.0001

λ1 × λ2 3.3315 19 0.17534 2.04 0.0306

λ1 × θ 2.9105 38 0.07659 0.89 0.6399

λ2 × θ 0.046 2 0.02298 0.27 0.7671

Error 3.2718 38 0.0861

Total 16.6955 119

4.2 Bias of the HH and HT estimators and estimates of the sample variance

As already said, the stopping rule induces some bias in the modified HH and HTestimators. Figures 5, 6 and 7 show the bias for n = 5, 15, 25, respectively. The HTestimator seems to have less bias than HH estimator but for very small initial samplesize (n = 5). Increasing the initial sample size leads to a better behaviour of the HTestimator with a relative bias alway less than 5%. The opposite is observed with theHH estimator as its bias increases as function of the initial sample size n. The biasof HH results to be almost always positive with the proposed stopping rule. Theseresults are in agreement with those presented by Su and Quinn (2003) even thoughthey report greater bias of the HH estimator when used with their variant of ACS.

123

14 S. A. Gattone, T. Di Battista

Table 4 ANOVA summarytable: reAC S∗ of μHT

Source Sum of square df Mean square F p > F

λ1 3.25417 19 0.17127 21.54 0

λ2 0.1093 1 0.1093 13.75 0

θ 3.70646 2 1.85323 233.11 0

λ1 × λ2 0.23617 19 0.01243 1.56 0.1185

λ1 × θ 0.22848 38 0.00601 0.76 0.8034

λ2 × θ 0.33752 2 0.16876 21.23 0

Error 0.3021 38 0.00795

Total 8.1742 119

0 50 100−0.1

0

0.1

0.2

0.3

λ1

0 50 100−0.1

0

0.1

0.2

0.3

λ1

0 50 100−0.1

0

0.1

0.2

0.3

λ1

0 50 100−0.1

0

0.1

0.2

0.3

λ1

0 50 100−0.1

0

0.1

0.2

0.3

λ1

0 50 100−0.1

0

0.1

0.2

0.3

λ1

Fig. 5 Relative bias of the HH (solid line) and HT (dotted line) estimators for ACS with the data-drivenstopping rule with an initial sample size n = 5. Simulated populations with λ1 ranging from 5 to 100,λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)

However, a direct comparison is not feasible as they use a stopping rule differentfrom Sl

i .Figure 8 shows that the conventional variance estimator ˆvar(μH H ) given in (9)

could be used in order to have a measure of the variability of the HH estimator withAC S∗. Relative bias results to be smaller than 5% for all populations considered. Onthe other hand, direct estimation of the variance of the HT estimator could not beobtained since the exact second-order inclusion probabilities cannot be evaluated withour proposed variant of ACS. Results of the simulations (not reported in this paper)have shown that the conventional variance estimator ˆvar(μHT ) given in (11) dose notgive an acceptable approximation of the variance of the HT estimator. Indeed, theapproximation of the inclusion probabilities often leads to estimates of the variance

123

Adaptive cluster sampling with a data driven stopping rule 15

0 50 100−0.1

−0.05

0

0.05

0.1

λ1

0 50 100−0.1

−0.05

0

0.05

0.1

λ1

0 50 100−0.1

−0.05

0

0.05

0.1

λ1

0 50 100−0.1

−0.05

0

0.05

0.1

λ1

0 50 100−0.1

−0.05

0

0.05

0.1

λ1

0 50 100−0.1

−0.05

0

0.05

0.1

λ1

Fig. 6 Relative bias of the HH (solid line) and HT (dotted line) estimators for ACS with the data-drivenstopping rule with an initial sample size n = 15. Simulated populations with λ1 ranging from 5 to 100,λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)

0 50 100−0.05

0

0.05

0.1

λ1

0 50 100−0.05

0

0.05

0.1

λ1

0 50 100−0.05

0

0.05

0.1

λ1

0 50 100−0.05

0

0.05

0.1

λ1

0 50 100−0.05

0

0.05

0.1

λ1

0 50 100−0.05

0

0.05

0.1

λ1

Fig. 7 Relative bias of the HH (solid line) and HT (dotted line) estimators for ACS with the data-drivenstopping rule with an initial sample size n = 25. Simulated populations with λ1 ranging from 5 to 100,λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)

123

16 S. A. Gattone, T. Di Battista

0 50 100

−0.04

−0.02

0

0.02

0.04

λ1

0 50 100

−0.04

−0.02

0

0.02

0.04

λ1

0 50 100

−0.04

−0.02

0

0.02

0.04

λ1

0 50 100

−0.04

−0.02

0

0.02

0.04

0.06

λ1

0 50 100

−0.04

−0.02

0

0.02

0.04

0.06

λ1

0 50 100

−0.04

−0.02

0

0.02

0.04

0.06

λ1

Fig. 8 Hansen–Hurwitz estimator: relative bias of the conventional variance estimators for ACS with thedata-driven stopping rule with an initial sample size n = 15. Simulated populations with λ1 ranging from5 to 100, λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)

with negative values. An easy to compute approximation of ˆvar(μHT ) is given bythe HH variance estimator ˆvar(μH H ) (Berger 1998). As it is well known, in generalthe HT estimator is more efficient than the HH estimator (Salehi 2003), thereforeˆvar(μH H ) has a positive bias as an estimator of var(μHT ) (Durbin 1953). The object

of further research could be to find a good variance estimator of the HT estimatorwhen used with our proposed variant of ACS.

4.3 Effective sample size

The specification of the exact upper limit of the number of sampled units is a key pointin many real life applications. Figures 9, 10 and 11 show the final sampling fraction fn

and f ∗n of both adaptive designs ACS and AC S∗ for different initial sample sizes. With

our stopping rule, f ∗n is always less than 0.4. For initial sample sizes n = 5, 15, f ∗

n isalmost always less than 0.2. For initial sample sizes n > 15, values of f ∗

n larger than0.2 are reported just for some populations with λ1 > 50, λ2 = 20 and θ = 1.5, 3.5.The analysis of the final sampling fraction of ACS highlights the effectiveness of ourstopping rule in limiting the sampling effort. Indeed, for the less clustered populationsfn exceeds 40 per cent and for λ1 > 60, λ2 = 20 and θ = 1.5, 3.5 extremely higheffective sample sizes are reported ( fn > 0.6). Under these circumstances ordinaryACS is unfeasible from both a logistical and a cost perspective. With our stopping rule,the effective sample size reduction is small with respect to ordinary ACS for highlypatchy populations with small cluster size (λ1 < 50, λ2 = 10, 20 and θ = 0.5). It

123

Adaptive cluster sampling with a data driven stopping rule 17

0 50 1000

0.2

0.4

0.6

0.8

λ1

0 50 1000

0.2

0.4

0.6

0.8

λ1

0 50 1000

0.2

0.4

0.6

0.8

λ1

0 50 1000

0.2

0.4

0.6

0.8

λ1

0 50 1000

0.2

0.4

0.6

0.8

λ1

0 50 1000

0.2

0.4

0.6

0.8

λ1

Fig. 9 Sampling fraction of ordinary ACS ( fn dotted line) and of ACS with the data-driven stopping rule( f ∗

n solid dotted line). Initial sample size n = 5. Simulated populations with λ1 ranging from 5 to 100,λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)

becomes relevant for the less clustered populations and as λ1 increases. Indeed, thespecification of the exact upper limit of the number of sampled units is a key point inmany real life applications (Rocco 2007). In Table 5, some empirical statistics aboutthe final sample size are reported for two simulated populations. We can see how ACSwill lead to sampling more units than cost and time would probably allow. At the sametime ACS* shows a good behaviour in controlling the highest final sampling effortand in reducing the variability of the final sample size.

5 Discussion

As it is well known (Brown 2003; Su and Quinn 2003), the simulation study has shownthat when there is no prior information about the rarity and the patchiness of the pop-ulation, applying adaptive cluster sampling could be prohibitive in terms of time andresources disposable and the survey would become unpractical. The stopping ruleproposed in this paper adds a sequential component to the ordinary ACS which aimsto predict the expected performance of ACS relative to that of SRS. This informationtaken from the sample data is then used to modify the sampling effort accordingly.

Despite the lack of theoretical grounding, simulation results show that the proposedstopping rule provides a substantial improvement in terms of efficiency when incor-porated with adaptive cluster sampling. For the most rare and clustered populations,the sampling effort of AC S∗ is very close to the one obtained with ACS without thestopping rule. Thus, in situations ideal for ACS (populations with few small networksand high-within network variance) the proposed stopping rule is not operating butin a very few samples. As a matter of fact, relative efficiency of AC S∗ is nearly thesame (slightly worse in general) to that of ordinary ACS. Nevertheless, in such a case,

123

18 S. A. Gattone, T. Di Battista

0 50 1000

0.2

0.4

0.6

0.8

λ1

0 50 1000

0.2

0.4

0.6

0.8

λ1

0 50 1000

0.2

0.4

0.6

0.8

λ1

0 50 1000

0.2

0.4

0.6

0.8

λ1

0 50 1000

0.2

0.4

0.6

0.8

λ1

0 50 1000

0.2

0.4

0.6

0.8

λ1

Fig. 10 Sampling fraction of ordinary ACS ( fn dotted line) and of ACS with the data-driven stopping rule( f ∗

n solid dotted line). Initial sample size n = 15. Simulated populations with λ1 ranging from 5 to 100,λ2 = 10 first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)

it would be preferable to adopt the ordinary ACS strategy which ensures unbiasedestimators. Moreover, the reduction of the sampling effort makes the adaptive clustersampling with a stopping rule more desirable than simple random sampling for a widerrange of populations. In particular, for the more compact cluster (θ = 0.5) and lowand high density populations (λ2 = 10, 20) the proposed variant of ACS seems to beeffective in reducing the well known edge units effect that is, to sample units whichdo not contribute to the estimator but at the same time augment the final sample size.Thus, we can conclude that the proposed design is as efficient as ordinary ACS for rareand tightly clustered populations and more efficient than ordinary ACS for a range ofpopulations that lack clustering.

As expected, the resulting estimators turn out to be biased. However, results haveshown that the estimators bias is negligible. Furthermore, Brown and Manly (1998)have shown that the bootstrap procedure can be applied under ACS design with a stop-ping rule in order to estimate the bias of the estimators. Thus, M SE AC S∗ and reAC S∗could be further improved. We were not able to provide a suitable variance estimatorfor the HT estimator so that potential users are recommended to use the HH estimatorwith the variant of ACS proposed in this paper.

The value c of the aggregative condition C , the treatment of the edge units and theway relative efficiency was measured deserve some more comments.

It could be argued that condition C = {yi > c = 1} would not be a reasonable onefor some of the populations simulated. Nevertheless, we stick on a value of c = 1 in

123

Adaptive cluster sampling with a data driven stopping rule 19

0 50 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

λ1

0 50 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

λ1

0 50 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

λ1

0 50 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

λ1

0 50 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

λ1

0 50 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

λ1

Fig. 11 Sampling fraction of ordinary ACS ( fn dotted line) and of ACS with the data-driven stopping rule( f ∗

n solid line). Initial sample size n = 25. Simulated populations with λ1 ranging from 5 to 100, λ2 = 10first row, λ2 = 20 second row and θ = 0.5, 1.5, 3.5 (first, second and third column)

Table 5 Statistics concerningthe final sampling effort of ACSand AC S∗ for two simulatedpopulations with λ1 = 50 andθ = 1.5. Initial sample sizen = 15, N = 400

Statistics AC S∗ AC Sλ2 λ2

10 20 10 20

E(v) 38 60 80 145

Max(v) 82 113 115 182

Std(v) 10.20 15.52 18.57 18.67

all the simulations since the purpose of this article is to provide an effective solutionunder the situation that the final sample size would be out of control. We are awareof the fact that if a value of c � 1 were used for the less patchy population onewould observe an improvement in the relative efficiency of ordinary ACS. However,the simulation study has shown that under conditions where ACS is suited, AC S∗would perform quite similarly.

When the stopping rule Sli is not satisfied the ordinary estimators used in ACS are

computed using the data collected until the step l − 1 and therefore we do not includethe units sampled at step l in the estimators. It could be argued that in such a way weare losing information. At this purpose we stress the fact that if condition Sl

i is notsatisfied it follows that the units aggregated at the l-th step would not provide valuable

123

20 S. A. Gattone, T. Di Battista

information and their inclusion in the estimator will cause a loss of efficiency withrespect to the previous step l − 1. In fact, we tried to use these units in the simulationsbut this did result in an efficiency loss of AC S∗ and in an increase of the bias of theestimators. This is clearly due, as in the ordinary ACS, to the very bad approxima-tion of the inclusion probabilities for these units. The procedure is in agreement withwhat happens for the ordinary ACS where the edge units are not used in the standardadaptive estimators because their inclusion probabilities cannot be determined fromthe sample data. Thus, the ordinary estimators of ACS incorporate only those edgeunits which were in the initial sample. So we do with our proposed variant of ACS.

More efficient estimators which use all the units sampled can be obtained by usingthe Rao-Blackwell theorem based on the minimal sufficient statistics (Thompson 1990;Salehi and Seber 1997b; Dryver and Thompson 2005). The evaluation of the perfor-mance of our proposed variant of ACS with these estimators could be the goal offurther developments of the present work.

The relative efficiency was measured without considering the edge units in the eval-uation of the final sample size E(v). Following Su and Quinn (2003), we choose tocompare the efficiency focusing on the number of sampling units used in the estima-tors. We point out that the inclusion of the edge units in the evaluation of E(v) woulddecrease the efficiency of both adaptive designs, ACS and AC S∗, with respect to SRSbut the relative performances would not be affected by this choice of sample size.Furthermore, with AC S∗ the number of edge units would be less than those sampledwith AC S. Thus, the overall performance of AC S∗ in comparison to AC S is likely toincrease.

Finally, it has to be noted that the proposed variant of ACS still has the drawbackof uncertainty of the final sample size as the original ACS proposed by Thompson(1990). Indeed, the stopping rule proposed is such that it might be possible to sam-ple completely a large network but the probability of this happening is related to thepresence of a large network with a variance of the children points around the parentshighly increasing. Such a network is probably least likely to be observed in real lifepopulations. However, results show that the final sampling fraction is well controlledin all simulations. The proposed stopping rule has the benefit of reducing the risk ofcost overruns due to the adaptive increase in sample size.

References

Berger GY (1998) Rate of convergence for asymptotic variance of the Horvitz-Thompson estimator. J StatPlan Inference74:149–168

Brown JA, Manly BJF (1998) Restricted adaptive cluster sampling. Environ Ecol Stat 5:49–63Brown JA (2003) Designing an efficient adaptive cluster sample. Environ Ecol Stat 10:95–105Christman MC (1997) Efficiency of some sampling designs for spatially clustered populations. Environ-

metrics 8:145–166Christman MC, Lan F (2001) Inverse adaptive cluster sampling. Biometrics 57:1096–1105Diggle PJ (1983) Statistical analysis of spatial point patterns. Academic Press, LondonDryver AL, Thompson SK (2005) Improved unbiased estimators in adaptive cluster sampling. J R Stat Soc

B 67:157–166Durbin J (1953) Some results in sampling theory when the units are selected with unequal probabilities. J

R Stat Soc B 15:262–269

123

Adaptive cluster sampling with a data driven stopping rule 21

Goldberg NA, Heine JN, Brown JA (2007) The application of adaptive cluster sampling for rare subtidalmacroalgae. Mar Biol 151:1343–1348

Lo NCH, Griffith D, Hunter JR (1997) Using a restricted adaptive cluster sampling to estimate Pacific hakelarval abundance. CalCOFI Rep 38:103–113

Magnussen S, Kurz W, Leckie DG, Paradine D (2005) Adaptive cluster sampling for estimation of defor-estation rates. Eur J For Res 124:207–220

Muttlak HA, Khan A (2002) Adjusted two-stage adaptive cluster sampling. Environ Ecol Stat 9:111–120Rocco E (2003) Constrained inverse adaptive cluster sampling. J Official Stat 19:45–57Rocco E (2007) Two-Stage Restricted Adaptive Cluster Sampling. Working paper 12, Dipartimento di

Statistica G. Parenti, Firenze.Salehi MM (2003) Comparison between Hansen–Hurwitz and Horvitz–Thompson estimators for adaptive

cluster sampling. Environ Ecol Stat 10:115–127Salehi MM, Seber GAF (1997a) Two-stage adaptive cluster sampling. Biometrics 53:959–970Salehi MM, Seber GAF (1997b) Adaptive cluster sampling with networks selected without replacement.

Biometrika 84:209–219Salehi MM, Seber GAF (2002) Unbiased estimators for restricted adaptive cluster sampling. Aust NZ J

Stat 44:63–74Smith DR, Conroy MJ, Brakhage DH (1995) Efficiency of adaptive cluster sampling for estimating density

of wintering waterfowl. Biometrics 51:777–788Smith DR, Villella RF, Lemarié DP (2003) Application of adaptive cluster sampling to low-density popu-

lations of freshwater mussels. Environ Ecol Stat 10:7–15Smith DR, Brown JA, Lo NCH (2004) Application of adaptive cluster sampling to biological populations.

In: Thompson WL (ed) Sampling rare or elusive species. Island Press, Covelo pp 93–152Su Z, Quinn TJII (2003) Estimator bias and efficiency for adaptive cluster sampling with order statistics

and a stopping rule. Environ Ecol Stat 10:17–41Thompson SK (1990) Adaptive cluster sampling. J Am Stat Ass 85:1050–1059Thompson SK, Seber GAF (1996) Adaptive sampling. Wiley, New YorkThompson SK (1996) Adaptive cluster sampling based on order statistics. Environmetrics 7:123–133Turk P, Barkowski JJ (2005) A review of adaptive cluster sampling: 1990–2003. Environ Ecol Stat 12:55–94Zacks S (2009) Stage wise adaptive design. Wiley, New York

123