Post on 21-Jan-2023
ENVIRONMETRICS
Environmetrics 2007; 18: 27–53
Published online 12 June 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/env.800
A Bayesian hierarchical model for over-dispersed count data:a case study for abundance of hake recruits
Jorge M. Mendes1, K. F. Turkman2*,y and Ernesto Jardim3
1Higher Institute of Statistics and Information Management, New University of Lisbon, Lisbon, Portugal2Faculty of Sciences, University of Lisbon, Lisbon, Portugal
3IPIMAR Lisbon, Portugal
SUMMARY
In this paper, we introduce a Bayesian Hierarchical model to estimate the abundance of hake recruits, as well as tostudy their spatial distributional patterns in the Portuguese territorial waters. Main objective of the paper is toimprove on traditional empirical methods based on sampling averages and variances by using probabilistic modelsthat capture spatial dependence structures. Copyright # 2006 John Wiley & Sons, Ltd.
key words: Bayesian hierarchical models; Gibbs sampling; MCMC; over-dispersed count data; hakerecruitment; abundance index
1. INTRODUCTION
In this paper, we introduce a probabilistic model to study spatial distribution patterns for over-
dispersed processes and apply it to hake recruits, estimating their abundance index in any given area.
There will be no attempt to study the time dynamics of these populations. Although, data covering the
whole Portuguese continental shelf are available, this study is restricted to the data covering the
northern shelf between Caminha and Berlengas, due to computational restrictions (data are described
in detail in Section 2). Figure 1 shows the sampling grids used in this study.
The observed data contain an excessive number of zeros as well as some large, extreme values. For
example, the data set contains 99 zero counts out of 272 observations and the range, excluding the zero
counts, is 606. Due to the observed over-dispersion, the conventional Poisson model or its normal
approximation would not be suitable for this data set. Therefore, we explore a model specially adapted
to handle data whose over-dispersion comes from the excess of zeros as well as from heterogeneity of
the population.
*Correspondence to: K. F. Turkman, Faculty of Sciences, University of Lisbon, Lisbon, Portugal.yE-mail: kfturkman@fc.ul.pt
Contract/grant sponsor: POCTI/MAT/44082/2002.
Received 13 October 2005
Copyright # 2006 John Wiley & Sons, Ltd. Accepted 16 April 2006
In the analysis, four strata covariates plus an index covariate, representing the abundance of hake at
year t � 1 that are able to spawn, are included. We suspect that the observed level of recruits are related
with physical and biological conditions such as the level of adult hake, bottom type, latitude, and depth
as well with human-dependent factors such as the time difference to sunrise or sunset when the
observation was made. A hidden, unobserved random field is also included as a covariate in the model.
This random field is necessary in this application becausewe have substantial prior belief that the counts at
‘nearby’ locations are correlated. This is due (at least in part) to the fact that the fish are attracted together
to specific habitats, and we know that habitat is correlated in space. Typically, one can view this process as
accounting for the effects of unknown spatial covariates, since it induces spatial structure in the count data.
Table 1 shows the covariates used in the analysis, as well as the codification used for centered covariates.
From exploratory analysis, we conclude that there is no substantial spatio-temporal interactions in
the data which worth modeling, considering the short (10 years) temporal component. Annual data at
each spatial locations show remarkably stable, trend free behavior and therefore, we concentrate on
modeling the spatial structure, ignoring time dynamics.
Spatial locations of data are given in terms of geographical spherical coordinates, namely in
degrees of longitude and latitude. Because the coordinate system is not isotropic in the sense that the
Figure 1. Grid of Portuguese surveys between 1990 and 1999 in North shelf
28 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
two measures, latitude and longitude, do not measure the same distance, a transformation is made in
order to use a isotropic spatial process as model. Hence, natural longitude values at each year are
multiplied by cosðlat � �=180Þ, where lat is the sample mean latitude at each year.
Traditional methods of estimating abundance of recruits are empirical and are based on the sample
average. However, these empirical methods do not take into consideration the strong spatial and
temporal dependence of the observations and as a consequence, sample averages as well as their
variances tend to underestimate the abundance and their sampling variation.
The paper is organized as follows: In Section 3, we look at statistical models for count data which
are particularly suited for taking into account over dispersion that exists in the data. We then introduce
a Bayesian Hierarchical model based on these counting processes. In Section 4, we give predictions as
well as model fitting evaluation of this model. Finally, in Section 5, we give conclusions on the
performance of this model.
2. DATA
Portuguese bottom trawl surveys have been carried out by the Portuguese Institute for Fisheries and
Sea Research (IPIMAR—Instituto de Investigacao das Pescas e do Mar) on the Portuguese continental
Table 1. Strata covariates used in sampling
Latitude
[41.25, 41.83[ (�2)[40.83, 41.25[ (�1)[40.33, 40.83[ (0)[39.83,40.33[ (1)[39.33,39.83[ (2)
Depth (meters)
[20, 100[ (�1)[100, 200[ (0)[200, 500[ (1)
Bottom type
Rock/Granules/Coarse sand (�2)Sand (�1)Fine Sand (0)SandþClay/GranulesþClay (1)Clay (2)
Difference between sun rising (or sunset) and beginning of trawl
þ60min before sun rise (�3)(60min, 1min) before sun rise (�2)(0min, 60min) after sun rise (�1)(61min after, 61min before) sunset (0)60min before sunset (1)(1min, 60min, after sunset (2)þ61min after sunset (3)
A MODEL FOR OVER-DISPERSED COUNT DATA 29
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
waters since June 1979 on board the R/V Noruega and R/V Capricrnio, twice a year in Summer and
Autumn. The main objectives of these surveys are: (i) to estimate indices of abundance and biomass of
the most important commercial species; (ii) to describe the spatial distribution of the most important
commercial species, (iii) to collect individual biological parameters such as maturity, sex-ratio,
weight, food habits, etc. (SESITS, 1999). The target species are hake (Merluccius merluccius), horse
mackerel (Trachurus trachurus), mackerel (Scomber scombrus), blue whiting (Micromessistius
poutassou), megrims (Lepidorhombus boscii and L. whiffiagonis), monkfish (Lophius budegassa
and L. piscatorius), and Norway lobster (Nephrops norvegicus). The gear used has been a Norwegian
Campbell Trawl 1800/96 (NCT) with a codend of 20mm mesh size, mean vertical opening of 4.8m,
and mean horizontal opening between wings of 15.6m (ICES, 2002).
The sampling design of these surveys follow a stratified random sampling design with 97
sampling locations distributed over 12 sectors. The sampling locations are fixed from year to year
and were selected based on historical records of clear tow positions and the allocation of at least
two samples by stratum. Each sector was subdivided into 4 depth ranges: 20–100, 101–200, 201–
500, and 501–750m, with a total of 48 strata. The tow duration during the period of analysis was
60min.
The data we have are the counts of recruits (individuals with age 0) in hauls of 60min
collected by the Autumn surveys from 1990 to 1999. The whole data set contains a total of 272
observations.
3. METHODS
3.1. Statistical models for count data
Typically, a Poisson model is assumed for modeling the count data. However, this model imposes
equal mean and variance to data, and fails to account for the over-dispersion that characterizes many
data sets. In many applications, the source of over-dispersion comes from the excess of zeros in the
data set. In the case of the Poisson model, one views a data set as zero inflated if there are more zeros
than the Poisson model can accommodate. If not properly modeled, the presence of excess zeros can
invalidate the distributional assumptions of the analysis, jeopardizing the integrity of the scientific
inferences (Lambert, 1992).
Over-dispersion can also be due to a heterogeneity in the population along with the excess of zeros.
In this case, the model should handle both of the problems.
Over-dispersed data can be modeled by the zero-inflated Poisson (ZIP) distribution (Johnson and
Kotz, 1962; Cohen, 1963), defined as follows:
Zi � 0; with probability piPoissonðliÞ; with probability 1� pi
�ð1Þ
so that
Zi ¼ 0; with probability pi þ ð1� piÞe�li
Zi ¼ z; with probability ð1� piÞe�li lzi
z! ; z > 0
(
30 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
This model implies that P½Zi ¼ 0� ¼ pi þ ð1� piÞe�li and P½Zi ¼ z� ¼ ð1� piÞ e�lilz
z! , for
z ¼ 1; 2; . . .. This model handles the over-dispersion as we can see by its respective mean and
variance:
E½Zi� ¼ ð1� piÞli and Var½Zi� ¼ ð1� piÞli þ pið1� piÞl2i ð2Þ
Furthermore, to account for the effects of the explanatory covariates whose values potentially affect
the ZIP parameters, one may consider the generalized linear models with link functions lnðlÞ ¼ X�and logitðpÞ ¼ Y�, where l ¼ ðl1; . . . ; lnÞ , p ¼ ðp1; . . . ; pnÞ, and X and Y are covariate matrices. If
the same covariates affect p and l, it is natural to reduce the number of parameters by thinking of p as a
function of l. Assuming that the function is known up to a constant nearly halves the number of
parameters needed for ZIP regression and may accelerate the computations considerably. In many
applications, however, there is little prior information about how p relates to l. If so, a natural
parameterization is
logðlÞ ¼ X� and logitðpÞ ¼ ��X� ð3Þ
for an unknown, real-valued shape parameter �, which implies that pi ¼ ð1þ l�i Þ�1. Although the idea
of the ZIP model is straightforward, practical implementation of the model still requires careful
examination of the relationships between the parameters p and l (Lambert, 1992).
The zero-inflated negative binomial (ZINB) model extends the ZIP model to handle over-dispersed
count data not only due to zero-inflation but also due to other forms of heterogeneity. The model is as
follows:
Zi � 0; with probability piNegBinð�i; �iÞ; with probability 1� pi
�ð4Þ
Setting �i ¼ 11þ�lki
and �i ¼ l1�ki
� , we get
ð1� piÞNegBinðz; li; �Þ; z > 0
pi þ ð1� piÞNegBinð0; li; �Þ z ¼ 0
�� ð5Þ
ð1� piÞ� zþl1�k
i�
� ��ðzþ1Þ� l1�k
i�
� � 1þ �lki� ��l1�k
i� 1þ l�k
i
�
� ��z
; z > 0
pi þ ð1� piÞð1þ �lki Þ�l1�ki� ; z ¼ 0
8>>><>>>:
ð6Þ
where � � 0 is a dispersion parameter that is assumed not to depend on covariates. Modeling p and lcan use the same strategy followed for ZIP model.
The mean and the variance of this model are given by
E½Zi� ¼ ð1� piÞli and Var½Zi� ¼ ð1� piÞli þ pið1� piÞl2i þ �ð1� piÞlkþ1i ð7Þ
A MODEL FOR OVER-DISPERSED COUNT DATA 31
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
It is clear from the previous expression that ZINB model can accommodate a higher dispersion than
the ZIP model. This model reduces to the ZIP in the limit � ! 0. The mean of the underlying negative
binomial distribution is li. The index k identifies the particular form of negative binomial distribution
(Saha and Dong, 1997); for k ¼ 0, the variance of the underlying negative binomial distribution is
ð1þ �Þli and for k ¼ 1, the variance is li þ �l2i . The parameterization therefore includes the two
most common forms of negative binomial (e.g., McCullagh and Nelder, 1989, p. 199).
One possibility to cope with the problem of over-dispersion is to assume that the heterogeneity
present in the data can be adequately described by some density f ðliÞ defined on the population of thepossible Poisson parameters li. Since this heterogeneity cannot be observed directly, it is also called
latent. We can only observe counts coming from the mixture density
P½Zi ¼ zi� ¼Z 1
0
Poissonðzi; liÞf ðliÞdli ð8Þ
Expression (8) defines a Doubly stochastic Poisson process (DSP) (Cox and Isham, 1980, p. 10, 50,
and 70; Cressie, 1991, p. 657) whose precise form depends upon the specific choice of f ðlÞ. Forcertain parametric forms, such as the gamma density which we shall examine in this paper, a closed
form expression for (8) can be obtained and in fact corresponds to the negative binomial model with
specific parameters. Indeed, if li has a prior gamma density Gað�; �Þ defined by
f ðliÞ ¼ ��
�ð�Þ l��1i e��li ; �; � > 0
we get a ZINB model. The ZIP model defined in Equation (1) can be rewritten as
Zijli � 0; with probability piPoissonðliÞ; with probability 1� pi
�
whose marginal density is given by
P½Zi ¼ zi� ¼Z 1
0
Poissonðzi; liÞ ��
�ð�Þ l��1i e��lidli
¼ �ðzi þ �Þ�ðzi þ 1Þ�ð�Þ
�
1þ �
� ��1
1þ �
� �zi ð9Þ
Therefore it is equivalent to:
Zi � 0; with probability piNegBinð�; �Þ; with probability 1� pi
�
or setting � ¼ !1�k
� and � ¼ !�k
� , we get the same model as in Equation (6):
ð1� piÞ � zþ!1�k
�
� ��ðzþ1Þ� !1�k
�ð Þ ð1þ �!kÞ�!1�k
� 1þ !�k
�
� ��z
; z > 0
pi þ ð1� piÞð1þ �!kÞ�!1�k
� ; z ¼ 0
8><>:
32 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
It is also clear from the above results that a DSP model using a gamma density for li leads to the
same results, apart from being zero-inflated, that is:
Zijli � Poissonðzi; liÞ; li � Ga � ¼ !1�k
�; � ¼ !�k
�
� �ð10Þ
has a marginal negative binomial distribution of the form:
P½Zi ¼ z� ¼� zþ !1�k
�
� ��ðzþ 1Þ� !1�k
�
� � ð1þ �!kÞ�!1�k
� 1þ !�k
�
� ��z
ð11Þ
where
E½Zi� ¼ !
and
Var½Zi� ¼ !þ �!kþ1
Other choices of models for the random intensity l may be considered, however, the resulting
Poisson process may not have a distribution with a closed form and hence be computationally
cumbersome.
Based on the ZINB model for count data given above, we now construct a Bayesian hierarchical
model for the hake recruitment data.
3.2. The Bayesian hierarchical model
Let Z ¼ fZðs; tÞ; s 2 D � R2; t ¼ 0; 1; 2 . . . ; 10g be the observed count (hake recruitment) at year
t and spatial location si ¼ ðlatitudei; longitudeiÞ. The data consist of zðsi; tÞ, i ¼ 1; 2; . . . ;Mt at
Mt spatial locations, t ¼ 1; 2; . . . ; 10 (see Figure 1), as well as a set of covariates
Xðs; tÞ ¼ fXkðs; tÞ; s 2 D � R2; t ¼ 0; 1; 2 . . . ; 10g, k ¼ 1; . . . ; 5, with observations xkðsi; tÞ,i ¼ 1; 2; . . . ;Mt at Mt spatial locations, t ¼ 1; 2; . . . ; 10, k ¼ 1; . . . ; 5.
Let R ¼ fRðs; tÞ; s 2 D � R2; t ¼ 0; 1; 2 . . . ; 10g be a binary hidden process that takes the value 0
if the observation zðs; tÞ is a structural zero, or 1 if zðs; tÞ is an observation (greater or equal to zero)
coming from a Poisson distribution. The likelihood of our model is simply the joint distribution of the
observed values Zðsi; tÞ and the binary hidden process R:
L ¼ pðZ;RÞ ¼ pðZjRÞpðRÞ ð12Þ
We now make a series of conditional independence assumptions to simplify the likelihood into a
computationally workable form. Before explaining in detail the model assumptions, we first give a
summary of the hierarchical model and its implementation with a slight abuse of notation for easy
reference:
A MODEL FOR OVER-DISPERSED COUNT DATA 33
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
* Level 1: Likelihood
zðsi; tÞjlðsi; tÞ;Rðsi; tÞ � Poissonðlðsi; tÞÞ; Rðsi; tÞ ¼ 1
0; Rðsi; tÞ ¼ 0
�ð13Þ
Rðsi; tÞj�ðsi; tÞ � Bernoullið�ðsi; tÞÞ
lðsi; tÞj�; !ðsi; tÞ � Ga1
�;
1
�!ðsi; tÞ� �
* Level 2: Link functionsLogarithm and the logit of the processes !ðsi; tÞ and �ðsi; tÞ at location si and at time t are,
respectively, linear functions of the covariates Xkðsi; tÞ, k ¼ 1; . . . ; 5 and a hidden, unobserved
Gaussian field. Specifically, let
cð!Þ ¼ �ð!Þ1 ; . . . ; �
ð!Þ5
� �
and
cð�Þ ¼ �ð�Þ1 ; . . . ; �
ð�Þ5
� �
be two sets of regression coefficients.
logð!ðsi; tÞÞ ¼ cð!ÞXðsi; tÞ þ �tðsiÞ
and
logitð�ðsi; tÞÞ ¼ cð�ÞXðsi; tÞ þ �tðsiÞ
* Level 3: Priors for parameters and hyperparameters
gt is an isotropic Gaussian random field, independent and identical in time. The priors for other
parameters are presented in Subsection 3.2.3.
We now explain in detail the assumptions leading to this model.
3.2.1. Likelihood: dependence and distributional assumptions
* Assumption 1We assume that conditional on Rðsi; tÞ, Zðsi; tÞ are Poisson random variables, independent in time
and space with mean lðsi; tÞ, so that:
pðZjRÞ ¼Y10t¼1
YMt
i¼1
pðZðsi; tÞjRðsi; tÞÞ ð14Þ
34 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
where
pðZðsi; tÞjRðsi; tÞÞ ¼ Poissonðlðsi; tÞÞ; with probability �ðsi; tÞ0; with probability 1� �ðsi; tÞ
�ð15Þ
* Assumption 2We assume that given the spatio-temporal process �ðsi; tÞ, Rðsi; tÞ are Bernoulli variables,
independent in space and time:
pðRjhÞ ¼Y10t¼1
YMt
i¼1
pðRðsi; tÞj�ðsi; tÞÞ ð16Þ
where pðRðsi; tÞj�ðsi; tÞÞ � Bernoullið�ðsi; tÞÞ.* Assumption 3
The spatio-temporal process k, conditional on � and a spatio-temporal process !ðsi; tÞ are gamma
random variables independent in space and time so that
pðkj�;xÞ ¼Y10t¼1
YMt
i¼1
pðlðsi; tÞj�; !ðsi; tÞÞ ð17Þ
where
pðlðsi; tÞj�; !ðsi; tÞÞ � Ga �; �ðsi; tÞð Þ
and � and �ðsi; tÞ can be obtained from Equation (10) by letting k ¼ 1 and appropriate modification
in notation.
* Assumption 4Conditional on the hidden spatial process gt � ð�tðsiÞ; . . . ; �tðsMt
ÞÞ0, t ¼ 1; 2; . . . ; 10 and on the
vector cð�Þ ¼ ð�ð�Þ1 ; . . . ; �ð�Þ5 Þ, the process h is independent in space and time, so that:
pðhjg; cð�ÞÞ ¼Y10t¼1
YMt
i¼1
pð�ðsi; tÞj�tðsiÞ; cð�ÞÞ ð18Þ
The process � is linked to cð�Þ and �tðsiÞ through a function presented in Subsection 3.2.2.
* Assumption 5Conditional on the hidden spatial process gt � ð�tðsiÞ; . . . ; �tðsMt
ÞÞ0, t ¼ 1; 2; . . . ; 10 and on the
vector cð!Þ ¼ ð�ð!Þ1 ; . . . ; �ð!Þ5 Þ, the process x is independent of � and are independent in space and
time, so that:
pð�;xÞ ¼ pð�Þpðxjg; cð!ÞÞ ¼ pð�ÞY10t¼1
YMt
i¼1
pð!ðsi; tÞj�tðsiÞ; cð!ÞÞ ð19Þ
The process x is linked to cð!Þ and �tðsiÞ through a function presented in Subsection 3.2.2.
A MODEL FOR OVER-DISPERSED COUNT DATA 35
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
* Assumption 6The hidden process gtðsÞ, t ¼ 1; 2; . . . ; 10 is assumed to be a stationary Gaussian random field and
conditional on ð�2� ; �Þ, is independent in time with
E½gtðsÞ� ¼ 0, for all t and s ð20Þ
and
Cov½�tðsÞ; �tðs0Þ� ¼ 2��ðjs� s0jÞ ¼ 1
�2�expð��js� s0jÞ ð21Þ
for all s, s0, and t.
Also, gt is assumed to be independent of cð�Þ and cð!Þ, so that
pðg; cð�ÞÞ ¼ pðgÞpðcð�ÞÞ ¼Y10t¼1
pðgtÞpðcð�ÞÞ ð22Þ
and
pðg; cð!ÞÞ ¼ pðgÞpðcð!ÞÞ ¼Y10t¼1
pðgtÞpðcð!ÞÞ ð23Þ
* Assumption 7
Conditional on �2�ð�Þ and �2
�ð!Þ , the vectors cð�Þ ¼ ð�ð�Þ1 ; . . . ; �ð�Þ5 Þ and cð!Þ ¼ ð�ð!Þ1 ; . . . ; �
ð!Þ5 Þ, are
assumed to be independent, so that
pðcð�Þj�2� Þ ¼Y5k¼1
pð�ð�Þk j�2�ð�Þ Þpð�2�ð�Þ Þ ð24Þ
and
pðcð!Þj�2� Þ ¼Y5k¼1
pð�ð!Þk j�2�ð!Þ Þpð�2�ð!Þ Þ ð25Þ
Finally, upon taking into account the assumptions 1–7 the likelihood can be written in the following
form:
L ¼ pðZ;RÞ ¼ pðZjRÞpðRÞ
¼Y10t¼1
YMt
i¼1
�ðsi; tÞpðzðsi; tÞjlðsi; tÞÞrðsi;tÞð1� �ðsi; tÞÞ1�rðsi;tÞn(
� pðlðsi; tÞj�; !ðsi; tÞÞp !ðsi; tÞj�tðsiÞ; cð!Þ� �o
p gtj�2� ; �
� �
�Y5k¼1
p �ð�Þk j�2�
� �Y5k¼1
p �ð!Þk j�2�
� �p �2�
� �p �2�
� �pð�Þpð�Þ
ð26Þ
36 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
where �ðsi; tÞ ¼ pðrðsi; tÞ ¼ 1j�ð�Þ; �tÞ. Note that, in the model structural zeros are modeled by a
Bernoulli variable, whereas population heterogeneity is modeled through a gamma family for
lðsi; tÞ:
f ðlðsi; tÞÞ ¼ �ðsi; tÞ��ð�Þ lðsi; tÞ��1
expð��ðsi; tÞlðsi; tÞÞ ð27Þ
so that
E½lðsi; tÞ� ¼ �
�ðsi; tÞ and Var½lðsi; tÞ� ¼ �
�2ðsi; tÞ
for the Poisson part of the model.
Therefore, the unconditional marginal density of Zðsi; tÞjRðsi; tÞ ¼ 1 is:
pðZðsi; tÞÞ ¼Z 1
0
pðZðsi; tÞjlðsi; tÞÞf ðlðsi; tÞÞdlðsi; tÞ
¼ �ðzðsi; tÞ þ �Þ�ðzðsi; tÞ þ 1Þ�ð�Þ
�ðsi; tÞ1þ �ðsi; tÞ� ��
1
1þ �ðsi; tÞ� �zðsi;tÞ
¼ � zðsi; tÞ þ 1=�ð Þ�ðzðsi; tÞ þ 1Þ� 1=�ð Þ ð1þ �!ðsi; tÞÞ�
1� 1þ !�1ðsi; tÞ
�
� ��zðsi;tÞ
ð28Þ
Hence, the unconditional complete model is
Zðsi; tÞ � �ðsi; tÞ � zðsi;tÞþ1=�ð Þ�ðzðsi;tÞþ1Þ� 1=�ð Þ ð1þ �!ðsi; tÞÞ�
1� 1þ !�1ðsi;tÞ
�
� ��zðsi;tÞ; zðsi; tÞ > 0
ð1� �ðsi; tÞÞ þ �ðsi; tÞð1þ �!ðsi; tÞÞ� 1�; zðsi; tÞ ¼ 0
8<:
where
E½Zðsi; tÞ� ¼ �ðsi; tÞ!ðsi; tÞ
and
Var½Zðsi; tÞ� ¼ �ðsi; tÞ!ðsi; tÞ þ ð1� �ðsi; tÞÞ�ðsi; tÞ!2ðsi; tÞ þ ��ðsi; tÞ!2ðsi; tÞ:
This model is clearly over-dispersed.
3.2.2. Link functions. From Subsection 3.2.1 we see that process !, the unconditional mean value of
Zðsi; tÞ, is linked to cð!Þ and � via the log-link function:
logð!ðsi; tÞÞ ¼ cð!ÞXðsi; tÞ þ �tðsiÞ ð29Þ
A MODEL FOR OVER-DISPERSED COUNT DATA 37
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
where cð!Þ is a vector of regression coefficients,�ð!Þk , k ¼ 1; 2; . . . ; 5, for the kth covariate
Xkðsi; tÞ.The Bernoulli variable is linked to cð�Þ and � via the logit function:
logitð�ðsi; tÞÞ ¼ log�ðsi; tÞ
1� �ðsi; tÞ� �
¼ cð�ÞXðsi; tÞ þ �tðsiÞ ð30Þ
where cð�Þ is a vector of regression coefficients, �ð�Þk , k ¼ 1; 2; . . . ; 5, for the kth covariate Xkðsi; tÞ.
Note that the gt random field induces spatial structure in the ! and � processes, and thus in observedcounts. In that sense, maps of gt may be interesting and lead to greater understanding of the spatial
distribution of the hake recruits. For example, these processes can partially account for orographic
differences in the sea ground and coast or concentrations of fish that are attracted to specific habitats
due to food availability, protection, good conditions for reproduction, etc.
3.2.3. Priors. For regression coefficients �k, k ¼ 1; 2; 3; 4 associated with covariates, we assume
independent Gaussian priors:
�ð�Þk � Gau 0; �2� ¼ 1
2�
!; k ¼ 1; . . . ; 5 ð31Þ
and
�ð!Þk � Gau 0; �2� ¼ 1
2�
!; k ¼ 1; . . . ; 5 ð32Þ
We assume gamma (Ga) prior distributions for all precision components:
�2� ¼ 1
2�
� Ga �ð�2� Þ1 ; �
ð�2� Þ2
� �ð33Þ
�2� ¼ 1
2�
� Ga �ð�2� Þ1 ; �
ð�2� Þ2
� �ð34Þ
where �ð�Þ1 and �
ð�Þ2 hyperparameters are specified as given in Table 3.
The parameter � control the rate of decline of correlation as a function of distance. It is often the
case in models with spatial parameters deep in the hierarchy (and thus, relatively far from the data), the
MCMC implementation has difficulty in convergence, and a tradeoff between the spatial process
precision �2� and the spatial dependence parameter, � is needed. One possible strategy to improve
convergence is to specify an uniform prior for the inverse of the parameter:
� ¼ 1
0�
38 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
with lower bound corresponding to a correlation of 0.05 at a distance equal to the half of maximum
distance (dðmaxÞ) and a correlation of 0.99 at a distance equal to the minimum distance (dðminÞ) betweenany pairs in the study (see the WinBUGS code for details):
0� � Unif
dðmaxÞ
�logð0:05Þ ;dðminÞ
�logð0:99Þ� �
ð35Þ
Finally, the parameter � is assumed to be
� � Ga �ð�Þ1 ; �
ð�Þ2
� �ð36Þ
where �ð�Þ1 and �
ð�Þ2 hyperparameters are specified as given in Table 3. The hyperparameter
specifications correspond to rather vague proper priors in order to let the data speak for itself.
Given the hierarchical representation presented above and using the same notation as in Equation
(26), one can evaluate the posterior distribution of all of the processes and parameters, given the
observed counts: for T ¼ 10,
P �; h1; . . . ; hT ;x1; . . . ;xT ; g1; . . . gT ; cð�Þ; cð!Þ; �2� ; �
2� ; �jz
� �
/Y10t¼ 1
YMt
i¼ 1
�ðsi; tÞpðzðsi; tÞjlðsi; tÞÞrðsi;tÞð1� �ðsi; tÞÞ1�rðsi;tÞn(
� pðlðsi; tÞj�; !ðsi; tÞÞpð!ðsi; tÞj�tðsiÞ; cð!ÞÞop gtj�2� ; �
� �
�Y5k¼ 1
p �ð�Þk j�2�
� �Y5k¼1
p �ð!Þk j�2�
� �p �2�
� �p �2�
� �pð�Þpð�Þ
ð37Þ
where z represents the observed counts and ht, xt, gt are the Mt � 1 vectorizations of the processes �,!, and � at observation locations, respectively.
One cannot evaluate this posterior distribution analytically and must resort to numeric simulation
methods. We use the special case of MCMC known as Gibbs sampling (see, e.g., Gilks et al., 1996).
3.3. Inference, prediction and model fitting evaluation
3.3.1. Model fitting evaluation. To evaluate how the model fits to the data we use two different
methods, both based on the posterior predictive distribution. The first method, suggested by Gelman
and Meng (1996) uses discrepancy variables built from posterior predictive distribution to measure the
discrepancy between model and data. The method is as follows. Let z be the observed data, � be the
vector of unknown parameters in the model, pðzjhÞ be the likelihood and pðhjzÞ be the posterior
distribution. We assume that we have already obtained draws hð1Þ; hð2Þ; . . . ; hðNÞ from the posterior
distribution using Markov chain simulation. We now simulate N hypothetical replications of the data,
which we label zrep1 ; zrep2 ; . . . ; zrepN , where z
repi is drawn from the sampling distribution of z given the
simulated parameters hðiÞ. Thus zrep has distribution pðzrepjzÞ ¼ R pðzrepjhÞpðhjzÞdh. Creating simula-
tions zrep adapts the old idea of comparing data to simulations from a model, with the Bayesian twist
that the parameters of the model are themselves drawn from their posterior distribution.
A MODEL FOR OVER-DISPERSED COUNT DATA 39
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
If the model is reasonably accurate, the hypothetical replications should look similar to the
observed data z. Formally, one can compare the data to the predictive distribution by first choosing a
discrepancy variable Tðz; �Þ which will have an extreme value if the data z are in conflict with the
posited model. Then a p-value can be estimated by calculating the proportion of cases in which the
simulated discrepancy variable exceeds the realized value:
estimated p-value ¼ 1
N
XNi¼ 1
IðTðzrepi;hðiÞ �Tðz;hðiÞÞÞ
where Ið�Þ is the indicator function which takes the value 1 when its argument is true and 0 otherwise.
Ideally, in a well fitted model, the p-value should be close to 0.5. The model fitting evaluation can be
made graphically, as well. We can plot a scatter plot of the realized values Tðz; hðiÞÞ versus the
predictive values Tðzrepi ; hðiÞÞ on the same scale. A good fit would be indicated by about half the points
in the scatter plot falling above the 45 line and half falling below.
To evaluate the model we used as discrepancy variables:
T1ðz; hÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXni¼ 1
ðzi � E½Zijh�Þ2s
ð38Þ
and
T2ðz; hÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXni¼ 1
ðzi � E½Zijh�Þ2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiVar½Zijh�
ps
ð39Þ
The second method relies on posterior predictive distribution in the way that the simulated values,
zrep, of the posterior predictive distribution are used to approximate the characteristics of that
distribution, PðzrepjzÞ ¼ R pðzrepjhÞpðhjzÞdh, namely the expected value and standard deviation.
These values can be compared with the observed data in order to assess the prediction quality of
the model.
It is of interest to know how the model predicts observations not included in the data set for
inference, as well. We present also results, based on the posterior predictive distribution, which
show the predicted values for observations not included in data set for inference, that is
pðzjzÞ ¼ R pðzjhÞpðhjzÞdh, where z* denotes the data, excluding the observation included in z.
4. RESULTS
The results presented refer only to the years of 1990 and 1997, those with the higher and lower
sampling effort, and are used as examples of the kind of analysis this method allows. Note that for
validation we randomly select 20 observations that were taken out from inference process (see
Table 6). Figure 2 shows the spatial prediction grid over the Portuguese north continental shelf used for
prediction results.
40 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
The model given in the above sections is implemented by using the software WinBugs and it’s
addin GeoBugs. It can be downloaded as well as all its documentation from:
http://www.mrc.bsu.cam.ac.uk/bugs/winbugs/contents.shtml
The computer code is given in Appendix A
Table 2 shows the predicted p-values and Figure 3 shows the dot clouds of the values of T1ðz; hÞversus T1ðzrep; hÞ and T2ðz; hÞ versus T2ðzrep; hÞ (see Subsection 3.3.1 for explanation of these
measures) for the model presented.
We can conclude that the model fits relatively well and this is reflected in relative good predictions
(see Tables 6 and 4). However, Figure 3(a) can be misleading. Almost perfect distribution of the points
along the 45 line being due to the scale of the plot. Figure 4(b) zooms only to a part of Figure 3(a) to
highlight this point. It is clear that zooming Figure 3(a) the distribution of points is indeed ‘around the
45 lines’ and not only over it. This is due to scale of the plot. In addition, we plot the minimum, mean,
and maximum of the 1000 replicates generated by the predictive distribution (Figure 4(a)) for 20
randomly chosen observation locations. In this figure, as it is demonstrated by the range of box plots,
the large variability generated by the model is very clear, although it is in line with the data.
Table 3 shows summary measures of the marginal posterior of the parameters of interest obtained
via Gibbs sampling.
Figure 2. Grid of north Portuguese Continental shelf used for prediction
Table 2. Model fitting quality
T1 T2
0.53 0.47
A MODEL FOR OVER-DISPERSED COUNT DATA 41
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
Some comments should be made about several parameters of the model. First, we note that the
value of the spatial dependence parameter, �, is low. Taking jointly into account the posterior mean of
the precision parameters �2� , we conclude that spatial dependence has a low decay in the processes. On
the other hand, the posterior standard deviation of �ð:Þ ’s are smaller by comparison to the square root of
posterior mean 2� ¼ 1=�2� , confirming that the spatial process � has a more important role explaining
the spatial variability present in the data than the covariates used in the analysis. The covariates used in
the analysis are important for explaining the x and h processes. On the other hand, the posterior mean
of the dispersion parameter � is relatively far from zero, which means that the model is clearly over-
dispersed.
It is possible to draw random samples of the multivariate Gaussian random field gt after
convergence has been achieved. It reduces to direct simulation from the multivariate Gaussian
distribution:
p gt jgt; �; �2�
� �
where gt � ð�ðs1Þ; . . . ; �ðsSÞÞ are the set of values of �ðsÞ at locations s for which prediction are
required. From assumption 4 in Subsection 4.2.1, it follows that:
pðgt jgt; �; �2� Þ ¼ MVG
PT��P�1
�� gt;P
�� �PT
��P�1
��
P��
� �ð40Þ
where ��� ¼ Var½g�, ��� ¼ Cov½g; g�, and ��� ¼ Var½g�.Figure 5 shows the posterior mean and posterior standard deviation for the gt-process for 1990.
One might examine these maps over the years to identify possible habitat covariates that are
Figure 3. Model fitting quality measured through discrepancy variables (Zi DSP model) (a) T1ðy; hÞ versus T1ðyrep; hÞ and (b)T2ðy; hÞ versus T2ðyrep; hÞ
42 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
represented by the spatial random field. Figure 6 shows the posterior mean and posterior standard
deviation for 1997, precisely the year with few observations sparsely distributed (see Figure 1).
Clearly, the posterior standard deviation of the process is higher in regions where there are no
observations.
Table 3. Gibbs sampling results
Parameter Prior Prior Prior MCMC posterior 2.5% Median 97.5% Convergencedistribution mean std mean (std) monitor (R)
� Ga(8, 1.5) 5.33 1.89 1.3419(0.335) 0.744 1.333 2.038 1.014
�!1 Gauð0; �2� Þ 0 — 0.0319(0.172) �0.320 0.035 0.368 1.021
�!2 Gauð0; �2� Þ 0 — �0.1771(0.236) �0.686 �0.157 0.220 1.001
�!3 Gauð0; �2� Þ 0 — �0.1052(0.16) �0.427 �0.103 0.216 1.000
�!4 Gauð0; �2� Þ 0 — 0.4921(0.164) 0.176 0.495 0.811 1.016
�!5 Gauð0; �2� Þ 0 — �0.1591(0.252) �0.700 �0.147 0.301 0.998
��1 Gauð0; �2� Þ 0 — �0.0592(0.232) �0.573 �0.051 0.377 1.017
��2 Gauð0; �2� Þ 0 — �0.3956(0.388) �1.332 �0.330 0.220 1.002
��3 Gauð0; �2� Þ 0 — �0.1977(0.281) �0.810 �0.181 0.289 0.999
��4 Gauð0; �2� Þ 0 — 0.3065(0.248) �0.104 0.286 0.869 1.003
��5 Gauð0; �2� Þ 0 — �0.1342(0.284) �0.737 �0.124 0.391 1.001
� — — — 3.1358(0.506) 2.645 2.982 4.571 1.001
�2� Ga(0.01, 0.01) 1 10 13.9556(16.154) 1.824 9.104 58.613 1.015
�2� Ga(2, 10) 0.2 0.1414 0.1794(0.032) 0.124 0.177 0.252 0.998
Figure 4. Model variability checking. (a) Minimum, mean and maximum of 1000 predictive distribution replicates for 20
observation counts. The x-axis (observed) values are 1, 19, 8, 1, 0, 0, 1, 95, 0, 16, 2, 68, 7, 1, 0, 15, 46, 0, 0, 0. (b) Zoom of
Figure 3(a)
A MODEL FOR OVER-DISPERSED COUNT DATA 43
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
Although the zero-inflated probability process, h, is considered more as a ‘nuisance’ process in the
analysis, it is instructive to examine the associated parameters. First, we note from the comparison of
the posterior standard deviations of �ð�Þ and �2� (Table 3) that the spatial process is particularly
important in explaining this process. Table 5 shows the posterior mean and respective credible
intervals for � at 20 random selected observations not included in the inference. Its clear that, although
the posterior mean is well above 0.5, the credible intervals show that this process has a great degree of
variability as well, reflecting the data variability itself. A priori, it was not easy to establish a
relationship between the process � and l. Nevertheless, we can imagine that they are somehow related.
Figure 7 shows the relation between process � and process l and counts Z. It is clear that there exists a
well defined non-linear relationship which could help in future research as a mean of establishing new
models for � so we can reduce the variability and uncertainty about it.
Figures 8 and 9 show the posterior mean and standard deviation of Z for 1990 and 1997. As
expected, the variability is very high.
Figure 10 shows the posterior mean of h for 1990 and 1997. The results show clearly that the
probability of a structural zero is low.
Finally, Figures 11 and 12 show the posterior mean and posterior standard deviation of the k-
process at 1990 and 1997. These plots show clearly that the posterior standard deviation is not
proportional to the posterior mean, as expected with a simple Poisson count data model.
Table 6 shows the prediction results for those 20 random selected observations not included in
inference. The table shows the observed value, the mean of the predictive distribution along with the
sample mean of the predicted values of the predictive distribution and the correspondent 95% credible
intervals. As we can see, all of the credible intervals contain the observed value. And the lower limit of
Table 4. Predicted values of lðsi; tÞ and respective 95% credible intervals for 20 random selected observations*
[location, t] Sample mean 95% HPD lower bound 95% HPD upper bound
½s4; 1� 5.817 0.0000 17.31½s5; 1� 8.477 0.0000 31.33½s9; 1� 13.94 0.0000 63.15½s13; 1� 6.217 0.0000 25.41½s28; 1� 102.7 0.0002 399.90½s38; 1� 41.33 0.0000 98.95½s1; 2� 108.2 0.0010 435.30½s16; 2� 52.27 0.0001 238.80½s23; 2� 110.6 0.0002 359.90½s34; 2� 17.92 0.0000 76.90½s14; 3� 142 0.0002 469.00½s4; 4� 52.61 0.0000 209.10½s11; 4� 70.98 0.0000 213.50½s18; 4� 6.627 0.0000 23.81½s6; 6� 12.07 0.0000 72.24½s17; 6� 3.45 0.0003 16.57½s18; 9� 9.119 0.0000 39.25½s24; 9� 328.6 0.0062 1653.00½s2; 10� 10.64 0.0000 39.69½s18; 10� 23.27 0.0008 122.30
*Not included in the inference.
44 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
the predicted intervals all are equal to zero, which means the model is able to generate a big number of
zero values, reproducing the huge variability of the data.
5. DISCUSSION
We proposed a zero inflated doubly stochastic Poisson model for the hake recruitment data and
implemented this model by using the Bayesian hierarchical modeling scheme. The benefits of this
methodology is twofold: first, complex spatial dependence structures as well as covariates are
introduced in the model to explain spatial variations and over-dispersion that exist in the data; second,
by simulation of vast data sets using Markov Chain Monte Carlo (MCMC) methods, we are able to
assess the credible intervals for abundance estimates at any point in space, as well as the estimate of
total abundance in a given area.
An important issue is how much of the output from the analysis derives from the data and how
much from prior modeling. All the prior specifications correspond to vague priors in order to capture
as much as possible the signal from the data. Sensitivity analysis was carried out to some key
parameters and the resulting posterior distributions were insensible to the priors used in modeling.
The combined spatial analysis of the k and h (see Tables 4 and 5) processes can be very helpful in
the definition of marine protected areas. It would allow the usual definition of areas with higher
abundance, by the inspection of l, but it would also provide maps of the probability of finding fish,
Figure 5. Posterior mean and posterior standard deviation of � process at 1990
A MODEL FOR OVER-DISPERSED COUNT DATA 45
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
which might be seem as an indicator of the effectiveness on stock protection. For example, it would be
more important to protect an area with a high probability of finding fish and a mean abundance index,
then to close an area with a maximum observation but with a low probability of finding fish over the
years. Analysis of these processes will be considered for future work.
The relation between k and h that was found gives some hints for future work, in particular it shows
that the two processes are not independent and so their correlation can be used for modeling.
The abundance index is traditionally calculated through the sample average of the counts.
Confidence limits for the abundance index are added assuming that the index is Gaussian and the
data are spatially independent. As expected, abundance index estimates on any year based on
empirical averages and their sampling variations underestimate the true values (Figure 13). This is
due, in part, to the vast number of zeros presented on the data set which do not seem to be structural
zeros, as we can see from the distribution of the � process. It is important to note that although both
methods present similar trends of the recruitment index, the relative difference among the years shift.
There is a tendency to estimate low values closer than high values, so the contrast among the different
years recruitment estimates are clearer on the model-based method. This feature of the model may
have an impact on the stock perception, in particular on the identification of strong year classes, like is
the case of 1996.
Other models were also fitted to the same data set, with reduced success. The Poisson model
suggested by Diggle and Tawn (1998) (which we named MBG, for short) and a doubly stochastic
Poisson model excluding zero inflation (DSP) proved to be less successful. Table 7 shows the deviance
Figure 6. Posterior mean and posterior standard deviation of � process at 1997
46 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
Table 5. Predicted values of �ðsi; tÞ and respective 95% credible intervals for 20 randomly selectedobservations*
[location, t] Sample mean 95% HPD lower bound 95% HPD upper bound
½s4; 1� 0.6666 0.1601 0.9985½s5; 1� 0.6618 0.1271 0.9987½s9; 1� 0.7602 0.2741 0.9991½s13; 1� 0.612 0.0914 0.9954½s28; 1� 0.9157 0.7102 0.9998½s38; 1� 0.8348 0.3736 0.9996½s1; 2� 0.9191 0.6109 0.9999½s16; 2� 0.904 0.6256 0.9999½s23; 2� 0.9357 0.7515 0.9997½s34; 2� 0.7168 0.1566 0.9987½s14; 3� 0.9108 0.5356 1.0000½s4; 4� 0.8687 0.5091 0.9997½s11; 4� 0.8398 0.3890 0.9994½s18; 4� 0.4643 0.0013 0.9584½s6; 6� 0.6561 0.1058 0.9996½s17; 6� 0.5666 0.0839 0.9939½s18; 9� 0.47 0.0010 0.9679½s24; 9� 0.9553 0.8042 1.0000½s2; 10� 0.6866 0.0954 0.9980½s18; 10� 0.7445 0.1982 0.9995
*Not included in the inference.
Figure 7. Relationship in 1990 and 1997 between �ðsi; tÞ process and the predictive counts, Zðsi; tÞ of recruits (upper line) andpredictive lðsi; tÞ process (middle line) and predictive lðsi; tÞ process and predictive counts, Zðsi; tÞ of recruits (lower line)
A MODEL FOR OVER-DISPERSED COUNT DATA 47
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
Figure 8. Predictive mean and standard deviation of hake recruit counts at 1990
Figure 9. Predictive mean and standard deviation of hake recruit counts at 1997
48 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
Figure 10. Posterior mean and posterior standard deviation of h process at 1990 and 1997
Figure 11. Posterior mean and standard deviation of k process at 1990
A MODEL FOR OVER-DISPERSED COUNT DATA 49
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
Figure 12. Posterior mean and standard deviation of k process at 1997
Table 6. Observed and predicted values of hake recruits stock for 20 random selected observations*
[location, t] Observed Mean post. Sample (std) 95% HPD 95% HPD Statusdistribution mean lower bound upper bound
½s4; 1� 1 4.853 5.475(26.55) 0 16 H½s5; 1� 0 10.13 8.015(47.04) 0 29 H½s9; 1� 2 14.64 13.34(57.03) 0 50 H½s13; 1� 5 6.6 5.777(32.01) 0 24 H½s28; 1� 19 92.7 100.6(272.5) 0 376 H½s38; 1� 0 30.64 40.76(397.5) 0 104 H½s1; 2� 167 130.3 107(394) 0 454 H½s16; 2� 0 56.43 51.55(187.8) 0 235 H½s23; 2� 100 105.3 109.4(416.5) 0 365 H½s34; 2� 0 20.03 17.15(142.7) 0 77 H½s14; 3� 0 195.7 141.5(540.9) 0 470 H½s4; 4� 2 49.96 51.75(299) 0 212 H½s11; 4� 3 64.55 70.23(300.8) 0 210 H½s18; 4� 0 6.761 6.253(44.74) 0 23 H½s6; 6� 22 11.9 11.65(43.65) 0 71 H½s17; 6� 0 3.114 3.175(11.28) 0 15 H½s18; 9� 0 9.666 8.189(48.21) 0 37 H½s24; 9� 0 398.2 326.5(1196) 0 1681 H½s2; 10� 0 11.21 10.44(39) 0 41 H½s18; 10� 0 23.71 22.32(103.5) 0 125 H
*Not included in the inference. The column ‘status’ and the symbolH mean that the credible interval contains the observed value.
50 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
information criterion (DIC; Spiegelhalter et al., 2002) for these models. Although it is a more
parameterized model as compared to the others, the zero-inflated doubly stochastic Poisson model
seems to perform better, given the difference in the value of DIC which we find significant (see
column Deviance in Table 7). The value of pD is a measure of the complexity of the model, and it
represents the ‘effective number of parameters’. It is defined as the posterior mean of the deviance
(�2logðLðzjhÞÞ) minus the deviance evaluated at the posterior means of the parameters
(�2logðLðzjf�hÞÞ) and acts as a penalty for increasing model complexity (where L means likelihood;
see Spiegelhalter et al., 2002 for details).
Figure 13. The abundance index for Hake recruitment. (a) Abundance index obtained with our model together with credibility
intervals and the sample mean. (b) Abundance index obtained in the traditional way: sample mean of observation counts and its
95% confidence interval
Table 7. Deviance information criterion (DIC) for the three models attempted
Model DIC pD Deviance
ZIDSP 1060.29 152.69 907.6DSP 1094.17 161.36 932.8MBG 1135.05 192.06 943.9
A MODEL FOR OVER-DISPERSED COUNT DATA 51
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
ACKNOWLEDGMENTS
The authors are grateful to Antonia Amaral Turkman for her precious help in computational aspects of MarkovChainMonte Carlo methods, and to the coordinator of IPIMAR’s surveys, Fatima Cardador and to two referees fortheir careful reading of the paper.
REFERENCES
Cohen AC. 1963. Estimation in mixtures of discrete distribution. Proceedings of International Symposium on DiscreteDistributions, Montreal; 373–378.
Cox DR, Isham V. 1980. Point Processes. Chapman & Hall: London.Cressie N. 1991. Statistics for Spatial Data. Wiley: New York.Diggle PJ, Tawn JA. 1998. Model-based geostatistics (with discussion). Journal of Royal Statistical Society 47(3): 299–350.
Gelman A, Meng X-L. 1996. Model checking and improvement. In Markov Chain Monte Carlo in Practice. InterdisciplinaryStatistics, Gilks WR, Richardson S, Spiegelhalter DJ (eds). Chapman & all: London.
Gilks WR, Richardson S, Spiegelhalter DJ (eds). 1996. Markov Chain Monte Carlo in Practice. Interdisciplinary Statistics.Chapman & Hall: London.
ICES. 2002. Report of the International Bottom Trawl Survey Working Group. ICES CM 2002/D:03.Johnson NL, Kotz S. 1962. Discrete Distributions. Wiley: New York.Lambert D. 1992. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics 34: 1–14.McCullagh P, Nelder P. 1989. Generalized Linear Models (2nd ed). Chapman & Hall: London.Saha A, Dong D. 1997. Estimating nested count data models. Oxford Bulletin of Economics and Statistics 59: 423–430.SESITS. 1999. Evaluation of Demersal Resources of Southwestern Europe from Standardized Groundfish Surveys—Study.Final Report. DG XIV / EC, Contract 96/029.
Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. 2002. Bayesian measures of model complexity and fit (with discussion).Journal of Royal Statistics Society, Series B 64: 583–640.
APPENDIX
A. WinBugs program
The instructions for running the model using WinBUGS are as follows:
model {
for (j in 1:10) {
for (i in 1:NO_OBS[j]) {
Z[i,j] � dpois(lambda[i,j])
lambda[i,j]<-phi[i,j]*R[i,j]
R[i,j] � dbern(theta[i,j])
phi[i,j] � dgamma(nu[i,j],zeta[i,j])
nu[i,j]<-1/alpha
zeta[i,j]<-1/(omega[i,j]*alpha)
logit(q[i,j])<-gamma_theta_v[i]+eta[i,j]
——————————————————————————————————————————————
# This line is included in order to keep the
# zero-inflation probability in reasonable
# bounds, hence avoiding numerical problems
theta[i,j]<-max(0.000000000001,min(0.999999999999,q[i,j]))
——————————————————————————————————————————————
52 J. M. MENDES, K. F. TURKMAN AND E. JARDIM
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53
log(omega[i,j])<-gamma_omega_v[i]+eta[i,j]
gamma_theta_v[i]<-inprod(gamma_theta[1:5],COV[i,1:5])
gamma_omega_v[i]<-inprod(gamma_omega[1:5],COV[i,1:5])
}
eta[1:NO_OBS[j],j] � spatial.exp(mu_eta[1:NO_OBS[j]],
X[1:NO_OBS[j],j],Y[1:NO_OBS[j],j],tau2_eta,phi_eta,kappa)
kappa<-1
}
for (l in 1:max(NO_OBS[])) {
mu_eta[l]<-0
}
phi_eta<-1/phi_eta1
phi_eta1 � dunif(0.031,0.38)
gamma_theta[1:5] � dmnorm(mu_gamma[],P[])
gamma_omega[1:5] � dmnorm(mu_gamma[],P[])
for (g in 1:5) {
for (h in g+1:5) {
P[g,h]<-0
P[h,g]<-0
}
P[g,g]<-tau2_gamma
mu_gamma[g]<-0
}
alpha � dgamma(8,1.5)
tau2_gamma � dgamma(0.01,0.01) tau2_eta � dgamma(2,10)
}
We note that mu_eta[] is a vector giving the mean for each area (which we found to be zero),
X[] and Y[] are vectors of length NO_OBS[] giving the x and y coordinates of each sampling point,
tau2_eta and tau2_gamma are scalar parameters representing the overall precision (inverse
variance) parameter of processes g and �� and �!, phi_eta is a scalar parameter representing the rate
of decline of correlation with distance between points and kappa is a scalar parameter controlling the
amount of spatial smoothing. This is constrained to lie in the interval ‘[0, 2)’. In our case is equal to 1.
R[] is the hidden process that controls zero-inflation. The probability of this binary process,
theta[] is constrained to be between the bounds 0.000000000001 and 0.999999999999 because
of numerical problems during MCMC computation. The parameter alpha is the dispersion
parameter.
We have to input data Z[] and COV[], as well as X[], Y[], and kappa. We also have to input
initial values for all the parameters, which we usually choose to be zero for parameters gam-ma_theta, gamma_omega, and eta and one for the precisions tau2_eta and tau2_gamma,and for parameters phi_eta1 and alpha.
A MODEL FOR OVER-DISPERSED COUNT DATA 53
Copyright # 2006 John Wiley & Sons, Ltd. Environmetrics 2007; 18: 27–53