Using selection functions to describe changes in environmental variables

13
Environmental and Ecological Statistics 2, 225-237 (1995) Using selection functions to describe changes in environmental variables L.L. McDONALD I, L. GONZALEZ 2andB.F.J. MANLY 2 l WEST Inc., 2003 Central Avenue, Cheyenne, WY 82001, USA 2University of Otago, P.O. Box 56, Dunedin, New Zealand Received November 1992; revised September 1995 The selection function (which shows how the frequency of sampling units with the value X = x at one point in time must change in order to produce the distribution that occurs at a later point in time) is proposed for describing the changes over time in an environmentally important variable J(. It is shown that the theory of selection functions as used in the study of natural selection and resource selection by animals requires some modifications in this new application and that a selection function is a useful tool in long-term monitoring stu- dies because all changes in a distribution can be examined (rather than just changes in single parameters such as the mean), and because graphical presentations of the selection function are easy for non-statisticians to under- stand. Estimation of the selection function is discussed using a method appropriate for normal distributions and bootstrapping is suggested as a method for assessing the precision of estimates and for testing for signifi- cant differences between samples taken at different times. Methods are illustrated using data on water chemical variables from a study of the effects of acid precipitation in Norway. Keywords: environmental monitoring, selection function, natural selection, EMAP, environmental assessment, fitness function 1. Introduction We consider the situation where a set of monitoring stations is chosen by a random process from a larger pool of potential stations, and an environmentally important variable X is measured at the chosen stations at times tl, t2, t3, etc. There is then interest in the manner in which the distribution of X changes with time for the population of potential monitoring stations, as reflected by changes at the sampled stations. An example of this situation is the interpenetrating subsamples design envi- sioned for the US Environmental Protection Agency's (US EPA) Environmental Monitoring and Assessment Program (EMAP) (Overton et al., 1990). An important question in this context concerns the best way to test for and to represent changes in the distribution of X. Overton et al. (1990) emphasize the changes in the cumulative frequency dis- tributions between two points in time. We agree with their approach but consider that many gov- ernment officials and the public at large have difficulty understanding cumulative frequency distributions and changes in these. We therefore suggest the use of selection functions to represent the changes in the distribution of X over time. A selection function w(x) describes how the individuals in one population must be selected in order to produce the individuals in a second population. For example, if each individual in popula- tion 1 with X = x produces w(x) offspring with the same X value, then the distribution of Z for the offspring is the distribution for population 2. With a continuous random variable, w(x) is related to the probability density functions (p.d.f.) for 1352-8505 © 1995 Chapman & Hall

Transcript of Using selection functions to describe changes in environmental variables

Environmental and Ecological Statistics 2, 225-237 (1995)

Using selection functions to describe changes in environmental variables

L . L . M c D O N A L D I, L. G O N Z A L E Z 2 a n d B . F . J . M A N L Y 2

l WEST Inc., 2003 Central Avenue, Cheyenne, WY 82001, USA

2University of Otago, P.O. Box 56, Dunedin, New Zealand

Received November 1992; revised September 1995

The selection function (which shows how the frequency of sampling units with the value X = x at one point in time must change in order to produce the distribution that occurs at a later point in time) is proposed for describing the changes over time in an environmentally important variable J(. It is shown that the theory of selection functions as used in the study of natural selection and resource selection by animals requires some modifications in this new application and that a selection function is a useful tool in long-term monitoring stu- dies because all changes in a distribution can be examined (rather than just changes in single parameters such as the mean), and because graphical presentations of the selection function are easy for non-statisticians to under- stand. Estimation of the selection function is discussed using a method appropriate for normal distributions and bootstrapping is suggested as a method for assessing the precision of estimates and for testing for signifi- cant differences between samples taken at different times. Methods are illustrated using data on water chemical variables from a study of the effects of acid precipitation in Norway.

Keywords: environmental monitoring, selection function, natural selection, EMAP, environmental assessment, fitness function

1. Introduction

We consider the situation where a set of monitoring stations is chosen by a random process from a larger pool of potential stations, and an environmentally important variable X is measured at the chosen stations at times tl, t2, t3, etc. There is then interest in the manner in which the distribution of X changes with time for the population of potential monitoring stations, as reflected by changes at the sampled stations. An example of this situation is the interpenetrating subsamples design envi- sioned for the US Environmental Protection Agency's (US EPA) Environmental Monitoring and Assessment Program (EMAP) (Overton et al., 1990).

An important question in this context concerns the best way to test for and to represent changes in the distribution of X. Overton et al. (1990) emphasize the changes in the cumulative frequency dis- tributions between two points in time. We agree with their approach but consider that many gov- ernment officials and the public at large have difficulty understanding cumulative frequency distributions and changes in these. We therefore suggest the use of selection functions to represent the changes in the distribution of X over time.

A selection function w(x) describes how the individuals in one population must be selected in order to produce the individuals in a second population. For example, if each individual in popula- tion 1 with X = x produces w(x) offspring with the same X value, then the distribution of Z for the offspring is the distribution for populat ion 2.

With a continuous random variable, w(x) is related to the probabili ty density functions (p.d.f.) for

1352-8505 © 1995 Chapman & Hall

226 McDonald, Gonzalez, Manly

the two populations (Fig. 1). Iffl(x) is the p.d.f, for population 1, thenf2(x) = Aw(x)fl(x), where A is the constant required to ensure that f2(x) integrates to 1. Thus

f2 (X) = W(X)f 1 (X)/W ( l. 1 )

where v~ is the mean value of w(x) in population 1. Hence

w(x) = #f2(x)/fl (x) (1.2)

Equations (1.1) and (1.2) also apply for discrete distributions withj~ (x) defined to be the probability of X = x for population i. When sampling from a finite population, the frequency of sampling units with X = x is proportional tof l (x) in population 1 and to w(x)fl(x) in population 2. Hence, w(x) intuitively gives the relative change in the frequency of sampling units with X = x from time tl to time t2.

Equation (1.1) shows that if w(x) is multiplied by any positive constant then this will not affect the distribution in population 2. This means that # can be set at any value and w(x) will still be a

,=,

0 . 4 0 -

0 . 3 5 -

O .SO

0 . 2 5

0 . 2 0

0 . 1 5 -

0 . 1 0 -

0 . 0 5 -

0 . 0 0

I N I T I A L D I S T R I B U T I O N

1 2 3 4 5 6 7 8 9 1 0

2 8 -

126-

R 4 -

2 2 -

2 0

1 8 -

1 6 -

1 4 -

1 2 -

1 0 -

8 -

6 -

4

2

0 0

S E L E C T I O N F U N C T I O N

1 2 3 4 5 6 7 8 9 1 0

0.40 F I N A L D I S T R I B U T I O N

0 . 3 5 -

::~ 0.30

20.25 ~ - 0 . 2 0

~ : 0 . 1 5 - /

0 . 1 0

0 . 0 5 -

0 . 0 , 0 ~ , , " - - , - : ~ , , ~ ,

o 1 2 3 4 5 6 7 5 ~ l O

V A L U E O F V A R I A B L E

Figure 1. Graphs of the initial distribution, selection function, and final distribution for a variable X.

Using selection functions 227

selection function that will convert population 1 into population 2. In practice, it is therefore con- venient to let ~ = 1 and take

w(x) = f2(x) /fl (x) (1.3)

In some cases there will be more than one period of selection. Then the above ideas generalize in an obvious way so that the selection function can be defined to be the function w(x;O that changes the initial distribution of a population into the distribution observed after time t.

In the study of natural selection, population 1 is some initial population of organisms that is subjected to selection, leading to a population of survivors. The fitness of an individual with the character X can then be defined as the probability of that individual surviving, or any constant multiple of that probability. The function that gives fitness in terms of values of X is then a selection function as defined above, although it is often called a fitness function. The use of selection functions in this area is reviewed by Manly (1985), who notes that Pearson (1903) obtained some of the basic results for samples from multivariate normal distributions. Similar ideas have also been applied in the study of resource selection by wild populations (Manly et al., 1993).

The use of selection functions in biology has links with the general theory of weighted distribu- tions and the general problem of measuring the distance between two distributions. Equation (1.1) defines f2(x) as the weighted version of the distribution fl(x) with weight function w(x) (Patil and Rao, 1978; Patil 1991), and the Kullback-Leibler measure of the distance between the distributions defined byfl(x) andfz(x) is loge[w(x)] (Kullback, 1983). These links between selection functions and other areas of statistical theory have not been stressed in the past and are not pursued here.

In environmental monitoring one primary objective is to describe changes over time in the values of environmental variables. We show that the use of selection functions can be transferred to envir- onmental monitoring programs with minor changes to methods. However, first we develop some theory concerning exactly what a selection function represents in this context.

Consider a continuous variable. Suppose that a large population of units exists, and that a ran- domly chosen unit has the values XI and )(2 at times tl and t2 respectively. Let the p.d.f, for X1 be fl(xl), and assume that the conditional p.d.f, of )(2 given X1 = Xl is g(XZlXl). It follows that

f2(x2) = I ~ g(x2,xt)fl(Xl)dX 1 (1.4)

which can be interpreted loosely as the 'probability' that X2 = x2, given X1 -- xl, multiplied by the 'probability' that X1 = Xl, integrated over the possible values for Xl. Defining the selection function by Equation (1.3) it is then found that this function is

w(x) = g(x)/fl (x) (1.5)

where g(x)is the average value of g(x[xl) in the population at time tl, divided by the p.d.f, of the population at that time. A similar result holds in the case of a variable with a discrete distribution.

Equation (1.4) is of theoretical interest because it shows how the selection function is related to the distributions involved in population changes. However, when estimating the selection function it is best to view it as the ratio defined by Equation (1.3) for a continuous variable and the ratio

w(x) = P2(x)/P1 (x) (1.6)

for a discrete variable, where Pi(x) is the probability of a value X= x at time i. Under certain circumstances it will be considered appropriate to transform the data before

estimating a selection function. In the case of a discrete distribution there is clearly no effect of a

228 McDonald, Gonzalez, Manly

monotonic transformation because setting y = ~b(x) in Equation (1.6) just leads to

w(y) = P2(y)/Pl(y)

for any value of x. The situation is not quite so obvious with a continuous distribution because the p.d.f, of the transformed distributions includes a Jacobian term. However, it is found that the Jaco- bian cancels out so that the selection function value for X = x with the untransformed data is the value for the transformed data for Y= ~b(x).

2. Estimation of selection functions with normal distributions

Suppose that the distribution of X at time tl is normal with mean #1 and variance cr 2. Then the selection function for the time period tl to t2 is the ratio of the p.d.f, at time t2 to the p.d.f, at time tl:

w(x) = (Crl/O'2)exP[½ {(x - #~)/cr~} 2 - ½{(x - #2)/cr2} 2] (2.1)

(Manly, 1985, p. 61). An obvious estimator of this function is then obtained by substituting sample means and variances in the right-hand side of this equation, to obtain

~v(x) = (Sl/S2)exP[½ {x - 21)/Sl} 2 - ½{(x - 22)/s2} 2] (2.2)

w h e r e 2 i and si are the mean and standard deviation, respectively, of the observations on the sampled units at time ti.

Equation (2.2) is a complicated function of the values in two samples that are not independent because the situation being considered is where the same n sample units are measured at the two sample times. Determining the bias and variance of v?(x) is not straightforward, although approx- imations for these quantities could be determined using the usual Taylor series method given appro- priate assumptions about the correlation between years. We have decided not to pursue this possibility, but to investigate bootstrapping instead. We have found that bootstrapping works quite well when carried out on loge{~(x)} ---- fi(x) rather than v?(x). In particular, the standard error of fi(x) can be estimated for chosen values of x, and an approximate 95% confidence interval for u(x) = 1Oge{W(X)} can be estimated either using the bootstrap standard error or by the percentile method (Efron and Tibshirani, 1993), with resampling of pairs of values on sample units.

3. Test for a change in the distribution

To test the null hypothesis of no difference in the distributions of X at the two sample times we have considered the log-likelihood ratio statistic that is often used with two independent samples of size n from the same normal distribution. This is

/~ = n{1Oge(S~/s~) + 1Oge (S2T/S 2) } (3.1)

where si is the standard deviation in sample i and sr is the standard deviation for the two samples combined into one (Manly, 1985, p. 56). Of course, with the data being considered the two samples are not independent and the normality assumption will often not hold. Nevertheless, ,k is a sensible statistic and the bootstrap procedure described below is designed to work even in the absence of independence and normality.

We used the following bootstrap algorithm to test for a change in the distribution of X:

Using selection functions 229

a) Calculate A for the observed data, using Equation (3.1). b) Code the values in sample 2 to have the same mean and standard deviation as for the values in

sample 1, i.e. remove the differences in the mean and standard deviation between samples 1 and 2.

c) Take a bootstrap sample of n units with the paired values xl and x2 (with x2 coming from the standardized data in sample 2) and calculate a bootstrap value for the test statistic A. This simulates estimation of A with a constant population.

d) Repeat step (c), say B times, to obtain B bootstrap values of A. e) Consider the observed value of A to be significant at the 5% level if it is among the top 5% of

the B + 1 values consisting of the observed value and the bootstrap values.

With the analysis of real data the algorithm will be applied only once. Therefore B should be relatively large (e.g. 1000-5000), although a large value of B is not needed to justify the procedure and B = 19 makes significance at the 5% level possible.

4. Simulation study

We have carried out a simulation experiment to evaluate the bias involved in using Equation (2.2) as an estimator of a population selection function, and to assess the value of bootstrapping for the estimation of standard errors, the determination of confidence intervals, and testing for a significant change in a distribution. For this experiment the population was always normal with mean #i = 5 and variance o.2 = 1 at time tl and a factorial design was used for the values for other parameters, as follows:

number of units sampled: n = 20 and 40 population mean at time t2:#2 = #a and #1 + cr~ population variance at time t2:o.2 = 0.5o.1, o.1 and 2o.1 correlation between X at times ta and t2: p = 0.3 and 0.7.

We express the levels of #2 and o-2 in this way to emphasize that the simulation results apply for any situation where sample values are normally distributed. We also note that the nature of our estima- tors and test statistics implies that results for #2 = #l-cr~ should be the same as the results for #2 = #1 + o.1. Thus, we only considered an increase in the mean between the two sample times.

The required parameter values at time t2 were obtained by setting

where/30 and/31 were suitable chosen constants, and e was a pseudo-random value from a normal distribution with mean 0 and a suitably chosen standard deviation o-~. This model ensures that X2 is normally distributed so that Equation (2.1) gave the selection function exactly for these simulations.

We used Wichmann and Hill's (1982) generator for producing pseudo-random numbers between 0 and 1, and sums of 12 of these numbers for producing normally distributed pseudo-random num- bers. All results are based on 1000 sets of generated data with 250 bootstrap samples taken from each set.

Table 1 summarizes the simulation results that relate to the accuracy of the estimation of u(x) = loge{W(X)} values for x = 3, 4, 5, 6 and 7. These x values cover most of the range for the initial

2 3 0

g

~--, ".~ , -~ ;::~. ~ " ~ t"q r~

• . ~ - ~

¢,::1 0 ~ ~o .

o~o .8 • = ' . = oa II

. ~ ~ 6" ~ o 8 ~ ~ . ~ ' ~

o , ' ~ . .-~ L -

rz3 , , .~ • ,~

~ ~ o ~

O

, ,~ ~ 0 ©

,-4

G

".4

G

?,

~ ,.-.,

Y.

McDonald, Gonzalez, Manl~

I I I I I I I I I I I I

I I I I I I I I I I I I

I I I I l l I I I I I I I I I I

I I I l l l I l t l I I I I I I

Using se lec t ion f u n c t i o n s 231

I l l I I I I l l I l l

I I I I I I I I I I I I

I l l I I I I l l I I I

I l l I I I I l l I I I

I I I I

I I I I

0

~ ÷ ÷

232 McDonald, Gonzalez, Manly

distribution (#1 =E 2o-a) and are therefore a good indication of the estimation for all except extreme values of x. F rom this table the following comments can be made:

1. A comparison of the true values of u(x) and the means of estimated values shows that there is little evidence of serious bias in the estimator. I f anything, there is a tendency for the mean of estimates to be further from zero than the population value.

2. A comparison of the actual standard deviations of estimates of u(x) and the means of the bootstrap estimates of standard deviations shows that there is a tendency for the standard deviation to be overestimated (by up to about 20%).

3. The coverage of confidence intervals of the form fi(x) -4- 1.96(Bootstrap Standard Error) shows that generally this confidence interval has close to the correct coverage and, if anything, a tendency to be conservative (i.e. greater than 95%). However, the percentile confidence inter- val usually has less than 95% coverage.

Table 2 summarizes the results of the simulations in terms of the test for significant population changes. It shows the percentage of significant results obtained for all combinations of n,/z2, ~r2 and p as defined above. All tests were at a nominal 5% significance level. It can be seen from this table that:

4. The probability of a significant result seems to be slightly too high (at about 0.07) when the null hypothesis is true (#2 = #1 and 0" 2 = 0"1) with a sample size of n = 20.

5. The power of the test increases as the correlation between the two samples increases. 6. The test has reasonable power for detecting a change in the mean or the variance.

In summary, it appears that Equation (2.2) is a more or less unbiased estimator of u(x) values for normally distributed data, for which the standard bootstrap confidence interval is reliable. Further- more, still for normally distributed data, the bootstrap test for no population changes based on the statistic (3.1) seems to have good properties.

5. Estimation of selection functions with logistic regression

Our approach for estimating selection functions by logistic regression relies on the theory discussed

Table 2. Power of a bootstrap test for no change in a distribution based on a likelihood ratio test statistic. All values in the main body of this table are percentages of significant results from 1000 computer generated sets of data. Likely sampling errors (two standard errors) range from 1% for tabulated values of 5% or 95%, to 3 % for tabulated values of 50%. The tests involved 250 bootstrap samples in all cases

#2 p n 0 2 = 0.50"1 c'2 = 0"1 0"2 = 20"1

#1 0.3 20 80.9 6.0 82.5 40 98.8 5.4 99.2

0.7 20 95.7 7.5 95.4 40 100.0 5.9 99.9

#i ÷ 0-1 0.3 20 99.9 86.9 94.9 40 100.0 99.3 99.9

0.7 20 100.0 99.3 99.8 40 100.0 100.0 100.0

Using selection functions 233

by Seber (1984, pp. 310-12) for separate sampling. Estimation involves maximizing a likelihood function although technically, as shown by Anderson and Blair (1982), the estimators are not max- imum likelihood estimators for continuous variables.

The basis of the method in the present context is the idea that if the probability of observing X = x in sample 1 is proportional tofi(x) and the probability of observing 35 = x in sample 2 is propor- tional to w(x)fl(x), then the probability of observing this value of X in sample 1 given that it is observed in one of the samples, is

P(x) = w(x)f l(x)/{f l(x) + w(x)fl(x)} = w(x)/{1 + w(x)}

Hence if w(x) = exp(~0 + C~lX + o~2x 2) then

P(x) = exp(c~ 0 + chx + ~2x2)/{ 1 + exp(~ 0 + C~lX + c~2x2)} (5.1)

and the constants can then be estimated by standard logistic regression methods. Actually, the separate sampling scheme envisaged by Seber (1984) with independent random sam-

ples taken from the two populations being considered does not apply in the situation that we are considering because the same n sample units are measured in both samples. However, logistic regres- sion still provides a means of estimating a selection function of the form w(x) = exp(c~0 + oqx + c~2x2), and it has the useful property of not requiring any particular assumptions about the distribu- tion of 35 at the time when sample 1 is taken. The incorrect sampling model simply means that the usual method for calculating approximate variances and covariances for the estimators of the c~ values should not be used. However, these variances and covariances can be estimated by boot- strapping, with both observations on sample units being resampled together to maintain the observed correlation between the paired results in samples 1 and 2.

What may be of more concern is that there is currently no means of assessing whether the selection function w(x) = exp(c~0 + c~lx + o~2x 2) provides a satisfactory fit for the data. This is a limitation of the present work that has not yet been addressed. We recognize that this is an important consideration because in reality any assumed parametric form for the selection function will just be an approximation for the true unknown function and if lack of fit occurs then alternative parametric forms will need to be considered.

6. Example

As an example we consider data from a research study that was launched in 1972 in response to widespread concern in Scandinavian countries about the effects of acid precipitation (Overrein et al., 1980). As part of this program a survey was carried out on 42 small lakes in southern Norway in 1978 and 1981 (Mohn and Volden, 1985). Table 3 shows the value recorded for pH and the loga- rithms of calcium concentration (Ca, rag/l) in the two years. Logarithms of Ca are considered here because this has a distribution that appears to be close to normal.

To test for a significant change in the variables the statistic ~ of Equation (3.1) was calculated for each of the variables tested using 1000 bootstrap resamples. This gave the following results: pH, A = 0.69, p = 0.029; and loge(Ca), A = 0.93, p = 0.008. There is some evidence of changes in pH for the lakes and strong evidence of changes in Ca.

Selection functions were estimated for each variable using Equation (2.2) and logistic regression. The results are summarized in the first column of graphs in Fig. 2. Both methods of estimation give broadly similar results. Logarithms of selection functions are plotted against Ca rather than loge(Ca) because this is allowable even although the selection function was estimated on the log- transformed data.

234 McDonald, Gonzalez, Manl5

Table 3. Values recorded for pH and the logarithms of calcium concentration (Ca, mg/1) in the years 1978 and 1981

pH log (Ca)

Lakes 1978 1981 1978 1981

1 4.59 4.63 0.278 0.077 2 4.97 4.96 0.278 0.039 3 4.32 4.49 -0.654 -0.755 4 4,97 5.21 0.708 0.495 5 4.58 4.69 -0.416 -0.673 6 4.80 4.94 - 1.347 - 1.470 7 4,72 4.90 -0.528 -0.942 8 4.53 4.54 -0.673 -0.799 9 4.96 5.75 0.798 0.924

10 5.31 5.43 -0.635 -0.400 11 5.42 5,19 -0.371 -0.416 12 5.72 5.70 0.358 0.191 13 5.47 5.38 0.432 0.329 14 4.87 4.90 0.798 0.626 15 5.87 6.02 -0.248 -0.248 16 6.27 6.25 0.140 0.039 17 6.67 6.67 0.904 0.850 18 6.06 6.09 0.779 0.688 19 5.38 5.21 0.742 0.582 20 5.60 5.98 0.621 0.779 21 4.93 4.93 0.372 0.231 22 5.60 5.66 -0.777 -0.994 23 5.97 5.67 0.784 0.615 24 5,07 5.18 -0,713 -0.994 25 6.23 6.29 0.445 0.432 26 6.64 6.37 0,912 0.728 27 6.15 5.68 0.693 0.986 28 4.82 5.45 -0.821 - 1.139 29 5.42 5.54 -1.139 -0.734 30 4.99 5.25 -0.174 -0.635 31 5.31 5.55 -0.371 -0.446 32 5.99 6.13 -0.371 -0.416 33 4.63 4.92 -0.163 -0.261 34 4.47 4.50 -0.139 -0.598 35 4.60 4.66 -0.494 -0.734 36 4.88 4.92 - 1.022 - 1.386 37 4.60 4.84 1.244 0.959 38 4.85 4.84 0.531 0.262 39 5.06 5.11 -0.211 -0.315 40 5.97 6.17 -0.186 -0.117 41 5.47 5.82 -0.236 -0.274 42 6.05 5.57 1.068 0.215

Mean 5.30 5.38 0.03 -0.11 StDev 0.62 0.57 0.66 0.68

Using selection functions 235

1.5 u(x) 1.0

0.5-

0.0

-03

-1.0

-1.5 L

1, t 1,0

00

-1,0

-15

. . . . , . . . . , . . . . , , . . . . ,

1.5 i 1.0~

o.~

o.o[

4.~ 5.~ 5.~ 6.~ 4.~ 5.~ 5.~ 6,~ ¢~ 5.~ 5.~ 6.~

1.5 U(X)

1.0

0.5

0,0 -0.5- -1,0- .1.5-

Ca

~ L E

,,] 1.0

0"5 t 0 . 0

-0.5

-1,0

-1.5

' i 2 ' 3 i i ' 2 ' i ' 4 1 2 s

1.0 0.5 0.0 .0.5 -1,0 -1.5

Fig. 2. Selection functions with bootstrap confidence limits for water chemical variables measured in lakes in southern Norway. The first row of graphs gives the results for the variable pH and the second row for the variable Ca. The columns of graphs give (1) logarithms of selection functions estimated using equation (2.2) (E) and logistic regression (L); (2) estimates and confidence limits from equation (2.2); and (3) estimates and bootstrap confidence limits using logistic regression.

The second column of graphs in Fig. 2 shows the estimates of logarithms of selection functions from Equation (2.2) together with the standard bootstrap 95% confidence limits of Estimate i 1.96(Bootstrap Standard Deviation). The third column is similar but for logistic regression estimates. In a real application only one method for estimating selection functions would be used. Here we give the results for both methods so that they can be compared. If anything, estimation from Equation (2.2) is probably best in this application because the distributions of the variables that were used for estimation were close to normal.

The following comments can be made on the basis of the graphs.

1. For pH it appears that there was some reduction in very low values between the sample times because logarithms of estimated selection functions are below zero for low pH and above or close to zero otherwise. As low pH corresponds to high acidity there is no indication of increased acidity.

2. For Ca low values appear to have increased and high values to have decreased. This variable indicates buffer capacity in the watershed, with low values possibly suggesting vulnerability to acid precipitation (Mohn and Volden, 1985). The lakes may therefore have been more vulner- able in 1981 than they were in 1978.

7. Discussion

We have shown that the estimation of a selection function provides a useful method for describing

236 McDonald, Gonzalez, Manly

the changes in the distribution of environmentally important variables between two sample times, and that bootstrapping can be used to assess the level of sampling errors and to test for a significant change. The method appears to work well with data that are either normally distributed or that can be transformed to normality, using Equation (2.2) for estimation. For non-normal data estimation by logistic regression holds promise, although more work is needed on the assessment of goodness of fit, model selection and the effectiveness of bootstrapping.

There are a number of other areas where further research would be worthwhile, including the generalization of the theory to cover the use of a multivariate selection function to describe the changes in the distribution of several related environmental variables measured on the same sample units at two or more points in time. Kernel estimation also offers the possibility of producing improved estimation of selection functions with non-normal data. We have investigated this to some extent but do not report the results here because more work is required on methods to determine window widths for the estimation of the ratio of p.d.f.'s rather than just the p.d.f.'s themselves.

Acknowledgements

We are pleased to acknowledge the support of the US Environmental Protection Agency's Envir- onmental Monitoring and Assessment Program and the Environmental Research Laboratory, Cor- vallis, Oregon. Although the research described in this article has been supported by the US Environmental Protection Agency through contract 68 C0-0021 to Technical Resources, Inc. and Western EcoSystems Technology, Inc., it has not been subjected to Agency review and therefore does not necessarily reflect the views of the Agency and no official endorsement should be inferred. We also thank several referees for useful comments on earlier versions of this paper.

References

Anderson, J.A. and Blair, V. (1982) Penalized maximum likelihood estimation in logistic regression and discrimination. Biometrika, 69, 123-36.

Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. Chapman and Hall, London. Kullback, S. (1983) Kullback information. Encyclopedia of Statistical Sciences, 4, 421-5. Manly, B.F.J. (1985) The Statistics of Natural Selection on Animal Populations. Chapman and Hall, London. Manly, B.F.J., McDonald, L.L. and Thomas, D.L. (1993) Resource Selection by Animals: StatistieaIDesign and

Analysis for Field Studies. Chapman and Hall, London. Mohn, E. and Volden, R. (1985) Acid precipitation: effects on small lake chemistry. In Data Analysis in Real

Life Environment: Ins and Outs of Solving Problems, J.F. Marcotorchino, J.M. Proth and J. Janssen (eds), pp, 191-6.

Overrein, L.N., Seip, H.M. and Tollan, A. (1980) Acid Precipitation - Effects on Forest and Fish: Final Report. Norwegian Institute for Water Research, Oslo.

Overton, W.S., White, D. and Stevens, D.L. (1990) Design Report for EMAP. Report EPA/600/3-91/053, United States Environmental Protection Agency.

Patil, G.P. (1991) Encounter data, statistical ecology, environmental statistics, and weighted distribution methods. Environmetrics, 2, 377-423.

Patil, G.P. and Rao, C.R. (1978) Weighted distributions and size-biased sampling with applications to wildlife populations and human families. Biometrics, 34, 179-89.

Pearson, K. (1903) Mathematical contributions to the theory of evolution. XL On the influence of natural selection on the variability and correlation of organisms. Philosophical Transactions of the Royal Society of London, Series A, 200, 1-66.

Using selection functions 237

Seber, G.A.F. (1984) Multivariate Observations. Wiley, New York. Wichmann, B.A. and Hill, I.D. (1982) Algorithm AS183: an efficient and portable pseudo-random number

generator. Applied Statistics, 31, 188-90.

Biographical sketches

Lyman L. McDonald is President of Western EcoSystems Technology, Inc. of Cheyenne, Wyoming. After studying at Oklahoma State University and Colorado State University he has worked as an academic and consulting statistician for over 25 years with comprehensive experience in the appli- cation of statistical methods to design, conduct, and analyse environmental and laboratory studies. His specialties include sampling of biological communities, calibration of biased sampling proce- dures, jackknife and bootstrap procedures, capture-recapture and tag-recovery statistics, general linear models, multivariate analysis, and environmental impact assessment and monitoring pro- grams. He is the author of more than 50 papers in the scientific literature and is a joint author of Resource Selection by Animals: Statistical Design and Analysis for Field Studies.

Liliana Gonzalez is a Lecturer in the Department of Finance and Quantitative Analysis at the University of Otago, New Zealand. She holds degrees from Quindio University in Colombia, South America, the University of Houston and the University of Wyoming. Since finishing her PhD in geostatistics in 1990, she has been involved in research and consulting in the areas of environmental modelling and monitoring. A common theme has been the simulation of complex situations in ecol- ogy and project management. She is a consultant for Ecopetrol, the Colombian state-owned oil company, for the monitoring of fish populations and water quality in rivers affected by oil spills.

Bryan F.J. Manly is Professor of Statistics and Head of the Department of Mathematics and Statistics at the University of Otago, New Zealand. He was educated at the City University in London, from which he holds a DSc degree, and is also a Chartered Statistician of the Royal Sta- tistical Society and a Fellow of the Royal Society of New Zealand. During his career he has taught in Britain, Papua New Guinea, the United States and New Zealand, and has been a researcher and consultant in a wide variety of areas largely related to environmental and ecological problems. He is the author of six books, the editor of several conference proceedings and a book series, and has written many papers for scientific journals. His current interests are mainly in sampling wildlife populations and computer-intensive statistics.