Refining logistic regression models for wildlife habitat suitability modeling—A case study with...

14
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Transcript of Refining logistic regression models for wildlife habitat suitability modeling—A case study with...

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Author's personal copy

Ecological Modelling 222 (2011) 1354–1366

Contents lists available at ScienceDirect

Ecological Modelling

journa l homepage: www.e lsev ier .com/ locate /eco lmodel

Refining logistic regression models for wildlife habitat suitability modeling—Acase study with muntjak and goral in the Central Himalayas, India

Aditya Singh ∗, S.P.S. KushwahaForestry and Ecology Division, Indian Institute of Remote Sensing, Indian Space Research Organisation, Dehradun 248001, Uttarakhand, India

a r t i c l e i n f o

Article history:Received 15 March 2010Received in revised form 4 February 2011Accepted 13 February 2011

Keywords:Habitat suitabilityLogistic regressionSimulationGoralMuntjakBinsarHimalayas

a b s t r a c t

High quality habitat suitability maps are indispensable for the management and planning of wildlifereserves. This is particularly important for megadiverse developing countries where shortages in skilledmanpower and funding may preclude the use of mathematically complex modeling techniques andresource-intensive field surveys. In this study, we propose a simulation based k-fold partitioning andre-substitution approach to refine and update logistic regression models that are widely used for habitatsuitability assessment and modeling. We test the modeling strategy using data from a rapid field surveyconducted for habitat suitability assessment for muntjak (Muntiacus muntjak) and goral (Naemorrhaedusgoral) in the central Himalayas, India. Results obtained from simulations match expectations in termsof model behavior and in terms of published habitat associations of the investigated species. Qualita-tive comparisons with predictions from the GARP, MaxEnt and Bioclimatic Envelopes modeling systemsalso show broad agreement with predictions obtained from the proposed technique. The proposed tech-nique is suggested as a rapid-assessment precursor to detailed habitat studies such as patch occupancymodeling in situations where funds or trained manpower are not available.

© 2011 Elsevier B.V. All rights reserved.

1. Introduction

A robust understanding of the spatio-temporal dynamics ofbiodiversity-rich areas is crucial for managing landscapes underpressure from high anthropogenic stress (Noss, 1999; Roff andTaylor, 2000; Wright et al., 1994). This is especially true for megadi-verse developing countries where shortages in funding (Balmfordand Whitten, 2003; Bruner et al., 2004) and skilled manpower mayintroduce severe trade-offs in intensity (in terms of numbers ofspecies assessed) and coverage (in terms of spatial extent) of sci-entific investigations. Over the past few decades, the developmentof gap analysis techniques have addressed some of these issuesby leveraging easily available ecological data to model species-environmental associations over large areas in a spatially explicitmanner (Faith and Walker, 1996; Noss, 1999; Scott et al., 1993).Logistic regression however, remains a relatively popular tech-nique (Guisan and Zimmermann, 2000) for its robustness, relativelystraightforward statistical properties, and ease of interpretation ofresults (Manel et al., 1999a,b).

Logistic regression models have a long history of usage in thegeneration of habitat suitability maps (e.g., Guisan and Thuiller,

∗ Corresponding author. Present address: Department of Forest and Wildlife Ecol-ogy, 26B Russell Labs, 1630 Linden Dr., University of Wisconsin, Madison, WI 53706,USA. Tel.: +1 608 262 9975/352 222 6038; fax: +1 608 262 9922.

E-mail addresses: [email protected], [email protected] (A. Singh).

2005; Pearce and Ferrier, 2000; Pereira and Itami, 1991). How-ever, there are inherent shortcomings in logistic regression thathave to be taken into account (e.g., Fielding and Bell, 1997) beforemanagement decisions may be made on the basis of results. Oneof the major shortcomings is the apparent insensitivity of logisticregression models in handling false negatives (i.e., falsely assum-ing a species does not occupy a location) and the consequentbias introduced in parameter estimates because of the exclusionsuch locations. One of the more advanced alternatives to logisticregression is patch occupancy modeling (MacKenzie et al., 2002).In short, patch occupancy models allow the parameterization ofthe species-specific probability of occupancy of a site as a functionof habitat characteristics. By construct, patch occupancy modelsexplicitly handle cases where detection probabilities are less thanone (MacKenzie et al., 2002). Although habitat associations derivedfrom patch occupancy models can, in theory, be used to gener-ate habitat suitability maps, such applications are rare. Also, patchoccupancy modeling needs data from repeat visits to the studyarea and requires considerable training and statistical expertise forcomputation as well as interpretation of results.

The need for rapidly producing spatially explicit habitat suitabil-ity maps for habitat management plans is of special significance toa biodiversity rich country like India. India harbors 16 of 200 global‘Most valuable ecoregions’ (Olson and Dinerstein, 1998) and 5 of 51globally recognized biodiversity hotspots (Mittermeier et al., 2004).Further, the Wildlife (Protection) Act of India (MoEF, 1972) andthe National Wildlife Action Plan (MoEF, 2002) mandate the for-

0304-3800/$ – see front matter © 2011 Elsevier B.V. All rights reserved.doi:10.1016/j.ecolmodel.2011.02.012

Author's personal copy

A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366 1355

Fig. 1. Location of Binsar WLS in India.

mulation of habitat management plans at regular intervals for allprotected areas. However, with limited funding and trained man-power available, covering even a fraction of roughly 660 protectedareas in India (WII, 2009) can be a daunting task. Although patchoccupancy models have been used to associate habitat character-istics with occupancy rates for a small number of species in India(Karanth et al., 2009; Krishna et al., 2008; Srinivas et al., 2008), thestudies did not involve the development of spatially explicit habitatsuitability/occupancy maps. When habitat suitability maps weregenerated on the basis of logistic regression techniques (Imam et al.,2009; Kushwaha et al., 2004; Zarri et al., 2008), the shortcomingsof logistic regression described previously were not addressed.

In this study, we propose a novel perturbation and simulationbased approach to logistic regression for habitat suitability model-ing and apply it to build habitat suitability maps for Indian muntjak(Muntiacus muntjak vaginalis, Zimmerman) and goral (Naemorhe-dus goral goral, Hardwicke) in the Binsar Wildlife Sanctuary (WLS)located in the central (Kumaon) Himalayas of India. Broadly, theproposed method involves iterative k-fold partitioning (Fieldingand Bell, 1997; Stockwell, 1992) and re-substitution (Fielding andBell, 1997; Osborne and Tigar, 1992; Stockwell, 1992) to evaluaterandomized ‘pseudo-absence’ locations for similarity with loca-tions where habitat use by the species has been confirmed. Finalhabitat suitability models are then developed with refined datausing coefficients derived from generalized linear mixed models(GLMMs) accounting for spatial autocorrelation in predictors. Wehope that the methods presented here will find application in stud-ies where presence-absence data is collected for species on whichlimited habitat preference information is available.

This study focuses on the habitat suitability assessment of munt-jak and goral for two reasons; firstly, these two ungulates arefound in a relatively higher density in the sanctuary than any otherfaunal species (D.V.S. Khati, personal communication, 2004) andtherefore, indirect evidence of presences such as pellet groups arerelatively easily identified by forest department personnel. The rel-atively higher density of these ungulates also makes them idealindicator species for the greater Kumaon ecosystem. Further, thecharacteristics of space use by these species are relatively betterknown from research conducted in other sanctuaries (e.g., Ilyas andKhan, 2003; Kushwaha et al., 2004; Mishra and Johnsingh, 1996;Roy et al., 1995).

The aim of this study, therefore, is to propose a method that bestleverages location data collected from rapid presence–absence sur-veys and which: (1) is amenable to generating spatially explicit

Fig. 2. Terrain-shaded land cover map of Binsar WLS derived from IRS-ID/LISS-III Satellite imagery.

Author's personal copy

1356 A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366

Fig. 3. Schematic diagram of proposed technique.

maps and is easily interpretable, (2) leverages easily availablecovariate data, and (3) in general, is of use where resources interms of time, trained personnel and funding are limited, and dataand can be used when historical presence-absence data is availablefrom semi-structured surveys on several indicator species. We alsopresent a qualitative comparison of the results from the proposedtechnique with three widely differing habitat suitability model-ing techniques viz. Maximum entropy modeling (Phillips et al.,2006), Genetic algorithm for rule-set prediction (GARP: Stockwelland Peters, 1999; Stockwell, 1992, 1999) and Bioclimatic envelopes(Nix, 1986).

2. Materials and methods

2.1. Study area

The Binsar Wildlife Sanctuary is located in the centralHimalayas, ca.30 km north of Almora town, Kumaon region, in theUttarakhand State of India (Fig. 1). The elevation ranges from ca

1000 m to 2400 masl and the climate is characterized by moder-ately cold (0–7 ◦C) winters and mild (8–33 ◦C) summers. The areareceives an average annual rainfall of about 1500 mm. The regionis characterized by forested hilly terrain and sparsely populatedagro-pastoral hamlets. The general landscape is characterized bydeeply dissected valleys and steep (∼30–45◦) slopes. Major for-est types in the sanctuary comprise oak (Quercus spp.) in thehigher elevations and mixed deciduous and coniferous species inthe lower reaches. Quercus leucotricophora is the dominant oakspecies followed by Q. floribunda (co-dominant) interspersed withsmall stands of Q. semicarpifolia, Q. incana and Q. gluca. Conifersdominate the lower slopes with chir pine (Pinus roxburghii) asthe dominant species interspersed with isolated stands of deo-dar (Cedrus deodara). Rhododendrons (Rhododendron spp.) formthe major shrub and understory layer in both oak and pine stands.Fauna comprise the goral (Naemorrhaedus goral), muntjak (Munti-acus muntjak), serow (Naemorrhaedus sumatraensis), Himalayanblack bear (Selenursus thibetanus), yellow-throated marten (Martesflavigula), Indian porcupine (Hystrix indica) and the common leop-

Author's personal copy

A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366 1357

Table 1Land cover in Binsar WLS derived from IRS1D/LISS-III satellite imagery.

Land cover Description Area (ha) % of total

Oak-Pine ∼50–60% Quercus leucotricophora, Q. floribunda co-dominant with ∼50–60% P.roxburghii

1981.33 40.25

Oak >80% Q. leucotricophora with Rhododendron arboretum understory. Little or noherb growth. Good grass growth near water sources.

1489.98 30.27

Pine >80% P. roxburghii, stunted Q. leucotricophora and R. arboreum. Deep needlelitter, good grass growth on slopes.

970.75 19.72

Mixed-riverine >40% Alnus nepalnensis and other woody vegetation interspersed with smallstands of Q. leucotricophora, Q. incana and/or Q. semicarpifolia and R. arboretum

299.99 6.09

Agriculture Rice paddy, wheat, seasonal vegetables 87.12 1.77Vacant Barren/uncultivated/rock outcrop 52.74 1.07Settlement Low-density stone masonry houses with wood/thatch roofs and livestock

enclosures28.06 0.57

Riverbed Rock talus in meandering streams, sand banks 12.00 0.24

Total 4921.97 100.00

ard (Panthera pardus). The sanctuary is also known for its largeassemblage of Galliformes such as the chukar partridge (Alec-toris chukar), black francolin (Francolinus francolinus), hill partridge(Arborophila torqueola), koklass pheasant (Pucrasia macrolopha) andthe kalij pheasant (Lophura leucomelanos).

2.2. Field survey

Data for the study was sourced from a rapid field survey con-ducted in March 2004 (Singh and Kushwaha, 2004 unpublishedreport). A systematic random sampling protocol was followed forplacing sampling transects. The survey design was based on a prioriinformation obtained from semi-structured interviews with fieldpersonnel followed by a pilot survey. The survey revealed that

apparent densities of the target species declined with elevationand corresponded with a general landcover-elevation gradient. Oakand oak-dominated forest classes were the most abundant in thehigher elevations followed by pine dominated forest classes in thelower slopes and mixed agricultural land use around the bound-aries of the sanctuary. In light of this information, eight, roughlylinear, transects were laid radiating out from roughly the centerof the sanctuary, also the point of highest elevation in the WLS.The first transect was oriented by looking at the second hand of awatch and multiplying the number by six. Once the orientation ofthe first transect had been determined, the remainder were laid in45◦ increments to the first. All transects were ca. 16 km long. Thisarrangement ensured uniform sampling across the broad terraingradient, across all major habitat types, and across the hypothe-

Fig. 4. AUC trajectories (©, on left y axis) and number of points added per iteration (�, on right y axis) for the goral model. AUC values are presented with ±1 standarddeviation bars. Subplots correspond to different combinations of liberal and conservative test/train fractions (Pt) and probability cutoff thresholds (Pc). Pt/Pc combinationsare: (a) 0.75/0.75, (b) 0.75/0.50, (c) 0.50/0.75, (d) 0.50/0.50.

Author's personal copy

1358 A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366

Fig. 5. AUC trajectories (©, on left y axis) and number of points added per iteration (�, on right y axis) for the muntjak model. AUC values are presented with ±1 standarddeviation bars. Subplots correspond to different combinations of liberal and conservative test/train fractions (Pt) and probability cutoff thresholds (Pc). Pt/Pc combinationsare: (a) 0.75/0.75, (b) 0.75/0.50, (c) 0.50/0.75, (d) 0.50/0.50.

sized gradient of target species densities. In each team, researchersassisted by forest guards marked sampling locations at ca. 500 mintervals. A 10 m circular plot was placed around each samplinglocation and the area searched intensively for pellets. Pellets foreither species were identified based on the knowledge of forestguards, correlation with known physical characteristics and werecross-confirmed with samples collected from local zoos. In all cases,plots were rejected if pellets were found to be in advanced stagesof decay. Presences of pellet groups, and in a few cases, actualsightings of target species were considered direct evidence of habi-tat use. The locations of pellet groups and animal sightings wererecorded in a Garmin 76TM GPS unit (Garmin International Inc.Olathe, Kansas USA) and the forest type and general terrain char-acteristics were noted in a field form. Of a total of 210 surveyedlocations, we collected 98 and 26 confirmed presence locations forgoral and muntjak, respectively.

It would be important to note that the survey design wasbased on a priori information on habitat use obtained fromsemi-structured interviews and a pilot survey. Such detailed infor-

mation is not always available, in particular when presence-onlydata is sourced from historical records (e.g., GBIF Data Portal,www.gbif.net). It is also possible that the nature of the survey(radial transects) may have resulted in an elevated chance ofrecording spatially autocorrelated observations.

2.3. Spatial data generation and pre-processing

To derive land cover information, an IRS1D/LISS-III satelliteimage dated 5th March 2004 was obtained corresponding to thedates of the field survey. The boundary of the study area was delin-eated using a base map provided by the Corbett Tiger ReserveField Director’s office at Ramnagar, Nainital. Elevation contours,villages, roads and streams were manually digitized from SOItoposheets (series 53O/10, 53O/14) and updated with satelliteimagery. All data were entered in a geographic information sys-tem (GIS), resampled to a uniform 10 m resolution and spatiallyreferenced to the Lambert Conformal Conic projection system.Terrain parameters such as elevation, terrain slope, topographic

Table 2Ranking procedure followed for final model selection. Wald statistics (ranks in parenthesis) from linear discriminant models are compared against the area under the receiveroperating characteristic curve (AUC) of generalized linear mixed models fit with radial basis spatial smoothing. Final selected models are shown in bold typeface.

Classification thresholda Muntjak Goral

Wald AUC Wald AUC

0.75/0.75 (L/Cb) 0.868 (1) 0.986 (4) 0.801 (1) 0.952 (4)0.75/0.50 (L/L) 0.687 (3) 0.987 (3) 0.380 (3) 0.986 (2)0.50/0.75 (C/C) 0.828 (2) 0.991 (2) 0.744 (2) 0.954 (3)0.50/0.50 (C/L) 0.682 (4) 0.994 (1) 0.377 (4) 0.987 (1)

a Test–train fraction (Pt)/probability cutoff threshold (Pc).b L: liberal threshold; C: conservative threshold.

Author's personal copy

A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366 1359

Fig. 6. Beta coefficient trajectories (with ±1 standard deviation bars) for the goral model obtained from different combinations of liberal and conservative test/train fractions(Pt) and probability cutoff thresholds (Pc). Pt/Pc combinations are: (a) 0.75/0.75, (b) 0.75/0.50, (c) 0.50/0.75, (d) 0.50/0.50. Symbols of covariates: (©) distance to streams, (�)distance to settlements, (♦) distance to agriculture, (�) topographic index, (�) eastness, (�) terrain slope. Coefficient trajectories of other covariates not shown for clarity.

index and aspect were derived from a digital elevation model(DEM) generated from contour data. Other landscape parame-ters such as distances to villages and streams were derived frominformation extracted from SOI toposheets. A land cover map ofthe study area was derived by digitally classifying satellite datausing maximum-likelihood supervised classification techniques inERDAS ImagineTM (Leica Geosystems, USA) software (Fig. 2). Imageclassification was assisted by ‘ground-truth’ information collectedduring the field survey.

Maps of distance-based measures (such as distance to villages,streams and land cover types) were log-transformed prior to allanalysis. Maps of distances to land cover types were used as covari-ates instead of using land cover as a categorical variable to precludebiases associated with classification errors. All maps were stan-dardized (centered and scaled) prior to all analyses. The final setof covariates included elevation, aspect, slope, topographic index,distances to individual land cover types, distance to villages anddistance to streams. To preclude complications involving circular-

Table 3Results of generalized mixed models based on refined model output. Beta coefficients (ˇ) for all variables are presented with standard errors (SE) and P values. Variablesfor which values are not presented (–) were rejected by the iterative variable selection procedure. Note: the model for Muntjak was fit with a complementary log-log linkfunction.

Muntjak Goral

ˇ SE P ˇ SE P

Intercept −13.2702 1.8696 <0.0001 −7.4303 5.8001 0.2004Distance to oak −2.1791 0.7264 0.0028 - - -Distance to oak-pine −1.4264 0.4746 0.0027 – – –Distance to pine 3.0048 1.0922 0.0060 −0.5474 0.1945 0.0050Distance to riparian – – – −0.6263 0.1931 0.0012Distance to drainage – – – −0.6768 0.1707 <0.0001Distance to settlements −2.1281 0.5968 0.0004 −0.9906 0.3708 0.0077Elevation 2.4059 1.8696 0.0009 3.4756 0.4471 <0.0001Northnessa −1.1396 0.3053 0.0002 0.7868 0.1643 <0.0001Distance to road – – – −0.5706 0.1275 <0.0001Distance to vacant land – – – 0.6094 0.2816 0.0307Terrain slope −0.5329 0.2437 0.0290 – – –

a Northness: sin(terrain slope).

Author's personal copy

1360 A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366

Fig. 7. Beta coefficient trajectories (with ±1 standard deviation bars) for the muntjak model obtained from different combinations of liberal and conservative test/trainfractions (Pt) and probability cutoff thresholds (Pc). Pt/Pc combinations are: (a) 0.75/0.75, (b) 0.75/0.50, (c) 0.50/0.75, (d) 0.50/0.50. Symbols of covariates: (©) distance tostreams, (�) distance to settlements, (♦) distance to agriculture, (�) topographic index, (�) eastness, (�) terrain slope. Coefficient trajectories of other covariates not shownfor clarity.

ity in terrain aspect, we transformed aspect to produce maps of‘eastness’ and ‘northness’ by taking the cosine and sine of aspect,respectively.

2.4. Generation of randomized ‘pseudo-absences’

The density at which randomized pseudo-absences were addedto survey locations was based on the spatial structure of the mostimportant habitat covariates (distances to oak, mixed oak-pine andpine forests, and elevation). Covariate data were extracted at sur-vey locations and empirical semivariograms were calculated foreach habitat layer. Exponential models were found to fit well withall semivariograms and were used to calculate the range at whichthe semivariance of each habitat covariate asymptoted. The semi-variogram for elevation did not have a determinate range and wasnot included the calculation of the mean. The density at whichpseudo-absence points were to be added to the data (as back-ground or pseudo-absence locations) was determined from themean of the range parameter of exponential models fitted to thefour semivariograms (� = 187.83 ± 64.92 m). Once the density hadbeen determined, we removed all absence data obtained from thesurvey (n = 113) and generated random locations within the bound-ary of the sanctuary. All random locations that fell within 188 m ofa presence location were removed. In all, 1123 points were addedto the dataset as pseudo-absence locations. Finally, covariate infor-mation for all surveyed presence and pseudo-absence locations wasobtained by intersecting locations with GIS layers.

3. Statistical analysis and spatial mapping

3.1. Statistical modeling strategy

The modeling strategy was based on an iterative partitioningand model-building approach to identify pseudo-absence loca-tions that best matched the environmental space defined byfield-verified presence locations for each species. Each model runconsisted of 1000 simulations nested within a model iteration(Fig. 3). At the start of each simulation, the data was partitionedinto a test set and a training set according to a predefined test/trainfraction (Pt). Each training set was used to build a logistic regres-sion model using all covariates as predictors. Beta coefficientsobtained from each logistic regression model were stored in a tem-porary array. For each simulation, pseudo-absence locations wereclassified into temporary ‘presences’ or ‘absences’ according to aprobability classification threshold (Pc) and stored in a separatearray. For every such iteration of the model therefore, 1000 suchlogistic regression models were built. At the end of each itera-tion, if a pseudo-absence location was classified as a ‘presence’more than 75% of the time (>750 simulations), the location wasclassified as a ‘presence’, the response vector updated, and a newiteration initiated with the updated database. At the end of eachsuch iteration, the means of beta coefficients of each covariate werecompared with the means of the beta coefficients of the covariatefrom the previous iteration using Bonferroni adjusted t-tests. Modeliteration terminated when non-significant P values were obtainedsimultaneously for all covariates. For each logistic regression model

Author's personal copy

A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366 1361

Fig. 8. Scatterplot of locations updated to ‘presences’ (©) and field-verified presence locations (�) overlaid on bivariate density kernels of environmental covariates definedby field-verified presence locations (�) for muntjak (subplots a and b) and goral (subplots c and d). Shaded contours correspond to confidence intervals depicted in legend.Constricted kernel estimates between 0 and 10 m are an artifact of the 10 m cell resolution of spatial data. Note: All axes in log scale.

built in each simulation, we also stored the area under the receiveroperating characteristic curve (AUC; Fielding and Bell, 1997) fromeach model. It should be noted here that Lobo et al. (2008) andPeterson et al. (2008) have shown that the AUC statistic can bea misleading measure of model performance in species distribu-tion modeling contexts. In particular, Peterson et al. (2008) havesuggested adjusting the “E” parameter (detailed in Peterson et al.,2008) to modulate the AUC statistic to a range reflective of the con-fidence one has in the ‘true’ presence rate. For the purposes of thisstudy however, we use the AUC in the native sense because all loca-tions where the presence of the respective species was questionablehad been removed. All statistical analysis was conducted using theR statistical computing environment (R, 2008).

To test the influence of varying Pc and Pt on model progress,we ran four scenarios using combinations of liberal and conserva-tive test/train fractions (Pt) and probability classification thresholds(Pc). We called Pt ‘liberal’ when only 25% of data were withheldfrom the training data set. We called the Pc ‘liberal’ when a >0.5predicted probability was allowed to result in the location beingdeclared a ‘presence’ in any simulation. The conservative strate-gies correspondingly used 50% of the data as a training set, andrequired a >0.75 predicted probability for a point to be labeled as a‘presence’. Note that, under either scenario, a pseudo-absence loca-tion still needed to be classified as a ‘presence’ in more than 750 of1000 simulations for the dataset to be permanently updated as a‘presence’ in the dataset.

For the conservative test/train fraction (Pt = 50% data withheldfor training), we expected the model to oscillate longer than for the

liberal strategy (Pt = 25% data withheld for training). We expectedthis because a larger amount of data excluded from the train-ing fraction would likely increase stochasticity in the simulations.Further, we expected the model to oscillate more for the liberalprobability classification threshold (Pc = 0.5) because a lower Pc

would enable a relatively larger number of points to be classifiedas presences. The large number of points added in each iterationwould likely produce large oscillations in the definition of suitableenvironmental space. Overall, we expected the best model to bea trade-off between larger perturbations in the training data set(lower Pt) and a higher selection probability cut-off (higher Pc).As an obvious consequence of the two strategies, we expected aminimal number of points to be added to the dataset under thecombination of a liberal test/train fraction strategy and a conser-vative probability cutoff threshold. Conversely, we expected themaximum number of points to be added under the conservativetest/train fraction strategy and a liberal probability cut-off thresh-old.

To select the best model from the four sets of scenarios, weevaluated the updated presence/absence dataset by following aquasi-heuristic strategy. First, for each dataset, we conducted alinear discriminant analysis (LDA) to test how well the updatedpresences and absences were discriminated as a function of allcovariates. Next, for each updated dataset, we fit a series of logis-tic regression models using the generalized linear mixed modelingframework using radial basis smoothing functions to account forspatial autocorrelation in predictors. The GLMMs were fit usingthe GLIMMIX procedure of the statistical computing software SAS®

Author's personal copy

1362 A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366

Fig. 9. Scatterplot of predictions obtained from the maximum entropy model (�), genetic algorithm for rule-set prediction (�) and bioclimatic envelope model (�), all ingray, overlaid on predictions from proposed technique. Legends and axes scales similar to Fig. 8.

v9.2 (SAS Institute, Cary NC, USA). Models for muntjak were fitusing a complementary log–log link because the low number ofpresence points sometimes caused convergence problems in theGLIMMIX procedure. Each such model was first fit with all vari-ables in the predictor set. We then sequentially dropped eachnon-significant predictor till only significant variables remained.AUC statistics were calculated for each final model thus generated.We ranked the AUC and LDA Wald statistics and compared the ranksto arrive at the models that were the best in terms of differencesin LDA scores and simultaneously had the minimum commissionand omission errors. Further, to test whether newly added pointswere ecologically similar to field-verified presence locations, wecalculated bivariate kernel density estimates of covariates basedon field-verified presence records and overlaid them with modelupdated presence locations. Pseudo-absence locations updated topresences were considered realistic if they fell within 95% confi-dence intervals of the environmental space defined by field-verifiedpresence locations.

3.2. Spatial mapping

We used the beta coefficients from models selected from theranking procedure to spatially predict probability of occurrencefor both species across the wildlife sanctuary. The probabilitymaps obtained were converted to suitable/unsuitable habitat zonesbased on a cutoff probability defined by the point where the sen-sitivity and specificity of the final logistic model was maximized.

Further, we used a majority filter to remove random artifacts andselected all patches larger than 0.5 km2 as being of managementimportance. We cross-tabulated the final habitat suitability mod-els with the existing land cover map to assess which habitat typeswere the most important for the two species and would need themost management attention.

3.3. Comparisons with other techniques

For comparison purposes, we built habitat suitability mod-els using MaxEnt (Phillips et al., 2006), GARP (Stockwell andPeters, 1999; Stockwell, 1992, 1999) and Bioclimatic Envelopes(Nix, 1986). We used OpenModeller (de Souza Munoz et al., 2009)implementations of all aforementioned models using default set-tings. Because MaxEnt typically outputs probabilities as comparedto binary presence/absence predictions from other models, wereclassified results from MaxEnt into presences if the predictedprobability was greater than 0.5. To compare predictions of eachmodel with the proposed technique, we plotted predictions fromeach of the models on bivariate kernel density plots of combinationsof covariates. Predictions from the models were considered realisticif they fell at least within 95% probability contours of the environ-mental space defined by field-verified presences. As the modelsdiffer considerably in their respective quantitative frameworks andtheoretical basis, we did not attempt a detailed comparison withthe proposed technique.

Author's personal copy

A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366 1363

Fig. 10. Final habitat suitability maps for goral and muntjak. Areas are shown as habitat suitable for goral, muntjak and area of overlap between the two.

4. Results

4.1. Satellite data interpretation and field survey

Analysis of satellite data in conjunction with the field sur-vey revealed that oak-pine mixed forests comprise the majority(40.25%, 19.81 km2) of the study area. Oak forests are the next majorforest type (30.27%, 14.89 km2) followed by pine forests (19.72%,9.70 km2, Table 1). Of a total of 210 surveyed locations, 98 and 26confirmed presence locations were collected for goral and muntjak,respectively. Presence locations for muntjak were mostly found inthe higher elevation oak and oak-mixed forests whereas presencelocations of goral were distributed throughout high elevation oakand pine dominated forests and ecotones therein.

4.2. Model behavior

All models terminated after different number of iterations basedon the combinations of liberal and conservative test/train fractionsand probability cutoff thresholds. Plots of AUC statistics averagedover each iteration suggested that all models had reached stableasymptotes before terminating (Figs. 4 and 5). Model trajecto-ries and number of points updated to presences were similar toexpectation. For goral, the model terminated with only four pointsconverted to presences for liberal Pt and conservative Pc (Fig. 4a)in contrast to 296 points converted to presences for conservativePt and liberal Pc (Fig. 4d). Similarly for muntjak, the model termi-nated with only three points converted to presences for liberal Pt

and conservative Pc (Fig. 5a) in contrast to 43 points converted topresences for conservative Pt and liberal Pc (Fig. 5d). As expected,estimates of beta coefficients oscillated the most for liberal Pc inall cases for both species (Figs. 6 and 7). However, when a conser-vative Pt was used, the effects were only discernable in terms of

the number of points converted to presences (18 for goral and 9 formuntjak, Fig. 4c and Fig. 5c).

Comparing models based on the ranks of the LDA Wald statisticand GLMM model AUCs revealed that for muntjak, a combinationof conservative Pt and conservative Pc was the best at discrim-inating between presence and absence records and minimizingcommission and omission errors (Table 2). The results for goralwere however inconclusive. We selected the same combination(conservative Pt and Pc) as the final model for goral based on a visualcomparison of the coefficient trajectories of the goral model (Fig. 6)with that of the muntjak model (Fig. 7). For both goral and munt-jak, almost all locations converted to presences from absences (orrandom) fell within 80% confidence intervals of the environmentalspace defined by field-verified presence locations (Fig. 8).

Comparisons based on overlaying predictions from GARP, Max-Ent and Bioclimatic Envelopes on bivariate kernel density plots ofcovariates revealed that predictions from the proposed techniquewere in broad agreement with the results from the three modelscompared (Fig. 9). Predictions from the proposed technique werehowever consistently more conservative.

4.3. Habitat suitability mapping

Probability-of-presence maps were built for both species usingcoefficients obtained from iteratively fit GLMMs (Table 3) usingupdated presence/absence locations as the response variable.Suitable habitat was extracted from the probability maps usingprobability cut-offs based on the point where the sensitivities andspecificities of the respective models were maximized. After fil-tering for artifacts and rejecting all patches <5 km2 in size, theareas suitable for muntjak and goral were found to be 178.04 ha(3.62% of the WLS) and 421.50 ha (8.56% of the WLS) respectivelywith a 168.55 ha (3.42% of the WLS) overlap (Fig. 10). Oak and oak-

Author's personal copy

1364 A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366

Table 4Results of final habitat suitability models for muntjak and goral cross-tabulated with landcover of the Binsar WLS. Results are shown as area suitable (in ha) with proportions(in percentages) suitable as a percentage of total area suitable (% of total) and as percentage of that landcover type available in the WLS (% of available).

Land cover Binsar WLS Muntjak Goral Overlap

Area (ha) % of total Area (ha) % of total % of available Area (ha) % of total % of available Area (ha) % of total % of available

Oak-Pine 1981.33 40.25 15.89 8.92 0.80 109.63 26.01 5.53 13.88 8.23 0.70Oak 1489.98 30.27 159.98 89.86 10.74 196.04 46.51 13.16 152.80 90.66 10.26Pine 970.75 19.72 0.01 0.01 0.00 54.72 12.98 5.64 0.01 0.01 0.00Mixed-Riverine 299.99 6.09 0.72 0.40 0.24 44.09 10.46 14.70 0.72 0.43 0.24Agriculture 87.12 1.77 0.62 0.35 0.71 12.56 2.98 14.42 0.62 0.37 0.71Vacant 52.74 1.07 0.78 0.44 1.48 2.99 0.71 5.67 0.48 0.28 0.91Settlement 28.06 0.57 0.04 0.02 0.14 1.47 0.35 5.24 0.04 0.02 0.14Riverbed 12.00 0.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

4921.97 100.00 178.04 100.00 3.62 421.50 100.00 8.56 168.55 100.00 3.42

pine mixed forests were the most suitable for muntjak (98.78% ofarea suitable) and oak, oak-pine mixed and pine forests were themost suitable for goral (85.50% of area suitable). The area of over-lap between the two comprised mostly of oak and oak-pine forest(98.89% of overlap area). Results of cross-tabulation of the areassuitable for the two species and existing land cover are presentedin Table 4.

5. Discussion and conclusions

The primary aim of this study was to develop a habitat suitabilitymodeling approach that was easily interpretable and could leverageeasily available covariate data and data collected from structuredrapid presence-absence surveys. We proposed a method employ-ing iterative k-fold partitioning (Fielding and Bell, 1997; Stockwell,1992) and re-substitution (Fielding and Bell, 1997; Osborne andTigar, 1992; Stockwell, 1992) to evaluate and update random andsurveyed ‘absence’ locations for similarity with field-verified pres-ence locations. We used updated presence/absence data to buildstandard logistic regression models and used the results to gener-ate habitat suitability models for two ungulate species in the BinsarWLS.

5.1. Model behavior

A visual analysis of model trajectories revealed that modelbehavior was according to expectations for all combinations ofliberal vs. conservative test/train fractions and probability cutoffthresholds. Conversely, the trajectories of beta coefficient estimatesand model AUCs revealed that models were largely stable whena conservative (Pc = 0.75) probability classification threshold wasemployed. Although there were no major differences in stability inbeta coefficients between liberal and conservative test/train frac-tion thresholds, it was observed that a liberal test/train threshold(Pt = 0.75, 25% data withheld) caused relatively few new points tobe updated to presences for either species. This could be because ofthe relatively small perturbations to the data set under the liberaldata partitioning scheme.

Further, we observed that all models had converged to stablecoefficient estimates for at least two iterations before termina-tion. Also, AUC trajectories for all models appeared to have reachedasymptotes when the overall model stabilized. In almost all mod-els, parameter estimates stabilized soon after when no more newpoints were updated to presences. This may indicate that mostmodels had attained convergence around a stable solution for theenvironmental space defined by the updated presence records.Importantly, covariates for most models stabilized at points differ-ent from their starting conditions. It is important to note that modelbehavior was generally the same for both species although theyhad widely differing initial number of presence locations (26 formuntjak, 98 for goral). This observation may indicate that although

logistic regression could be sensitive to the proportion of ‘pres-ence’ points in a database, a randomization approach could be usedfeasibly for rigorously exploring habitat associations between com-peting models. Model performance was conservative in general,but qualitatively similar to other widely used species distributionmodels such as GARP, MaxEnt and Bioclimatic Envelopes.

5.2. Habitat suitability patterns

For both species, models developed from updated presencelocations revealed habitat preference patterns similar to previ-ous research in the Himalayan region. For example, both speciesshowed preferences for higher elevations in the sanctuary. As thereis relatively little human presence in the higher reaches of thesanctuary, this may indicate avoidance higher human densities andagricultural pressure in the lower reaches of the sanctuary. How-ever, our model also indicated that goral might prefer to use habitatnearer to human settlements (Table 3). Although this is in variancewith Roy et al. (1995) and Mishra and Johnsingh (1996) who sug-gested that goral likely avoids human settlements. We speculatethat this divergence may be due to intrinsic differences betweenthe two landscapes. Topographically, Binsar WLS in the CentralHimalayas is relatively more dissected as compared to Rajaji NP(located in the Himalayan foothills; Roy et al., 1995) and there-fore steep slopes and escape routes are probably more accessiblein Binsar. Also, the Himalayan foothills have a significantly higheranthropogenic disturbance (Roy et al., 1995) as compared to Bin-sar, which has a single road terminating at the highest point in theWLS. Therefore, the differences between the habitat preferenceslikely reflect the differences in escape routes and water availabilitybetween the two habitats. However, it is also possible that goralprefer areas abutting agricultural land in Binsar due to good grassgrowth (sensu Mishra and Johnsingh, 1996). The response to munt-jak was similar to goral with regard to elevation, and distance tosettlements. However, muntjak differed with goral with respect todistance to pine forest and aspect of slope. Whereas goral preferredhabitat in proximity to pine forest, muntjak preferred areas withlow slopes and an abundance of oak and oak-mixed forests. Also,muntjak preferred south facing slopes whereas goral that preferrednorth facing slopes. These differences may indicate sympatric divi-sion in habitat use between the two species.

Of the area assessed suitable for muntjak, 98.78% (175.87 ha)comprised of oak and oak-pine mixed forests. Similarly, of the areaassessed suitable for goral, 85.50% (360.39 ha) comprised oak, oak-pine and pine forests. Of the area of overlap between the two,98.89% (166.68 ha) comprised oak and oak-pine forests (Table 4).The high proportion of oak and oak-pine forests preferred by thetwo species and the amount represented in the overlap indicatethe management importance of this habitat. Unstructured inter-views with field personnel suggested that high elevation oak forestsare under pressure from an increasing frequency of rogue for-

Author's personal copy

A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366 1365

est fires. Reportedly, low-intensity fires are lit by village folk inlower elevations of the WLS for better grass growth in the grazingseason. These fires frequently escape control zones and burn as low-intensity ground fires for prolonged periods in the higher elevationsof the WLS. Although we currently do not have data to supportthis hypothesis, it has been speculated that these fires confer anadvantage to the fire-adapted chir pine which colonizes freshlyburned patches while outcompeting native oak species. Althoughwildlife poaching in the WLS has been effectively controlled byactive patrolling by forest officers, indirect anthropogenic stres-sors like uncontrolled forest fires may have an adverse impact onnative fauna that depend on it for forage and refuge. We suggestthat research of the role of forest fires in influencing habitat pref-erences of ungulates and forest succession be urgently taken up inthe central Himalayan ecosystem.

6. Conclusions

In conclusion, field data updated with the proposed model refin-ing strategy resulted in progressively better fits (in terms of higherAUCs, but see Lobo et al., 2008; Peterson et al., 2008) than modelsbuilt with only field collected data. Model behavior was accordingto expectations and habitat suitability maps developed from modeloutputs were according to generally accepted species-habitat asso-ciations of studied species. We reiterate that the methods outlinedabove are not suggested as alternatives to current standards inhabitat assessment techniques (for e.g. patch occupancy models:MacKenzie et al., 2002) but should help supplement the toolboxavailable to the species distribution modeler. We believe that theproposed method will allow a larger number of habitat assess-ments to be carried out in situations where adequate funds andstatistical expertise are not available in the short-term but wheresemi-structured field surveys can feasibly be conducted. The tech-niques outlined may also help gap analyses to be conducted wheresparse (but reliable) presence data is available from planned fieldsurveys and to provide a baseline for future detailed wildlife habitatassessments using current standards in habitat assessment meth-ods. R Code required to run the models is available from the authorson request.

Acknowledgements

We wish to thank Mr. D.V.S. Khati, Field Director, CorbettNational Park for permission and logistics for field work. AS wassponsored by the Centre for Environmental Planning and Tech-nology, Ahmedabad. Travel and equipment support was providedby the Indian Institute of Remote Sensing, Indian Space ResearchOrganisation, Dehradun. We also wish to thank Shawn Serbin andClayton Kindon, Department of Forest and Wildlife Ecology, Univ.of Wisconsin-Madison, USA for their suggestions for improvementsin the manuscript. We extend our thanks to forest officer, Mr. R.S.Dangwal and all forest staff of the Binsar WLS for their dedica-tion and generous cooperation and help during the field survey.We also extend our thanks to three anonymous reviewers for valu-able suggestions that resulted in substantial improvements to themanuscript.

References

Balmford, A., Whitten, T., 2003. Who should pay for tropical conservation, and howcould the costs be met? Oryx 37, 238–250.

Bruner, A.G., Gullison, R.E., Balmford, A., 2004. Financial costs and shortfalls ofmanaging and expanding protected-area systems in developing countries. Bio-science 54, 1119–1126.

de Souza Munoz, M., De Giovanni, R., de Siqueira, M.F., Sutton, T., Brewer, P., Pereira,R.S., Canhos, D.A.L., Canhos, V.P., 2009. openModeller: a generic approach tospecies’ potential distribution modelling. GeoInformatica, 1–25.

Faith, D.P., Walker, P.A., 1996. Environmental diversity: on the best-possible use ofsurrogate data for assessing the relative biodiversity of sets of areas. Biodiversityand Conservation 5, 399–415.

Fielding, A.H., Bell, J.F., 1997. A review of methods for the assessment of predictionerrors in conservation presence/absence models. Environmental Conservation24, 38–49.

Guisan, A., Thuiller, W., 2005. Predicting species distribution: offering more thansimple habitat models. Ecology Letters 8, 993–1009.

Guisan, A., Zimmermann, N.E., 2000. Predictive habitat distribution models in ecol-ogy. Ecological Modelling 135, 147–186.

Ilyas, O., Khan, J.A., 2003. Food habits of barking deer (Muntiacus muntiak) andgoral (Naemorheadus goral) in Binsar Wildlife Sanctuary, India. Mammalia 67,521–531.

Imam, E., Kushwaha, S.P.S., Singh, A., 2009. Evaluation of suitable tiger habitat inChandoli National Park, India, using multiple logistic regression. Ecological Mod-elling 220, 3621–3629.

Karanth, K.K., Nichols, J.D., Hines, J.E., Karanth, K.U., Christensen, N.L., 2009. Patternsand determinants of mammal species occurrence in India. Journal of AppliedEcology 46, 1189–1200.

Krishna, Y.C., Krishnaswamy, J., Kumar, N.S., 2008. Habitat factors affecting site occu-pancy and relative abundance of four-horned antelope. Journal of Zoology 276,63–70.

Kushwaha, S.P.S., Khan, A., Habib, B., Quadri, A., Singh, A., 2004. Evaluation of sam-bar and muntjak habitats using geostatistical modelling. Current Science 86,1390–1400.

Lobo, J.M., Jimenez-Valverde, A., Real, R., 2008. AUC: a misleading measure of the per-formance of predictive distribution models. Global Ecology and Biogeography17, 145–151.

MacKenzie, D.I., Nichols, J.D., Lachman, G.B., Droege, S., Royle, J.A., Langtimm, C.A.,2002. Estimating site occupancy rates when detection probabilities are less thanone. Ecology 83, 2248–2255.

Manel, S., Dias, J.M., Buckton, S.T., Ormerod, S.J., 1999a. Alternative methods for pre-dicting species distribution: an illustration with Himalayan river birds. Journalof Applied Ecology 36, 734–747.

Manel, S., Dias, J.M., Ormerod, S.J., 1999b. Comparing discriminant analysis,neural networks and logistic regression for predicting species distribu-tions: a case study with a Himalayan river bird. Ecological Modelling 120,337–347.

Mishra, C., Johnsingh, A.J.T., 1996. On habitat selection by the goral Nemorhae-dus goral bedfordi (Bovidae, Artiodactyla). Journal of Zoology 240,573–580.

Mittermeier et al., 2004. Hotspots Revisited: Earth’s Biologically Richest and MostEndangered Ecoregions. Mexico City, Mexico.

MoEF, 1972. Indian Wildlife (Protection) Act, 1972. Dehra Dun, India.MoEF, 2002. National wildlife action plan (2002–2016).Nix, H.A., 1986. A biogeographic analysis of Australian elapid snakes. In: Longmore,

R. (Ed.), Atlas of Elapid Snakes of Australia, Australian Flora and Fauna Series,No. 7, pp. 4–15.

Noss, R.F., 1999. Assessing and monitoring forest biodiversity: a suggested frame-work and indicators. Forest Ecology and Management 115, 135–146.

Olson, D.M., Dinerstein, E., 1998. The global 200: a representation approach to con-serving the Earth’s most biologically valuable ecoregions. Conservation Biology12, 502–515.

Osborne, P.E., Tigar, B.J., 1992. Interpreting bird atlas data using logistic-models—anexample from Lesotho, Southern Africa. Journal of Applied Ecology 29,55–62.

Pearce, J., Ferrier, S., 2000. Evaluating the predictive performance of habi-tat models developed using logistic regression. Ecological Modelling 133,225–245.

Pereira, J.M.C., Itami, R.M., 1991. GIS-based habitat modeling using logisticmultiple-regression—a study of the Mt. Graham red squirrel. PhotogrammetricEngineering and Remote Sensing 57, 1475–1486.

Peterson, A.T., Papes, M., Soberon, J., 2008. Rethinking receiver operating charac-teristic analysis applications in ecological niche modeling. Ecological Modelling213, 63–72.

Phillips, S.J., Anderson, R.P., Schapire, R.E., 2006. Maximum entropy modeling ofspecies geographic distributions. Ecological Modelling 190, 231–259.

R,D.C.T., 2008. R: A language and environment for statistical computing.Roff, J.C., Taylor, M.E., 2000. National frameworks for marine conservation—a hier-

archical geophysical approach. Aquatic Conservation-Marine and FreshwaterEcosystems 10, 209–223.

Roy, P.S., Ravan, S.A., Rajadnya, N., Das, K.K., Jain, A., Singh, S., 1995. Habitat suitabilityanalysis of Nemorhaedus goral—a remote-sensing and geographic information-system approach. Current Science 69, 685–691.

Scott, J.M., Davis, F., Csuti, B., Noss, R., Butterfield, B., Groves, C., Anderson, H., Caicco,S., Derchia, F., Edwards, T.C., Ulliman, J., Wright, R.G., 1993. Gap analysis—a geo-graphic approach to protection of biological diversity. Wildlife Monographs,1–41.

Srinivas, V., Venugopal, P.D., Ram, S., 2008. Site occupancy of the Indian giant squirrelRatufa indica (Erxleben) in Kalakad-Mundanthurai Tiger Reserve, Tamil Nadu,India. Current Science 95, 889–894.

Stockwell, D., Peters, D., 1999. The GARP modelling system: problems and solutionsto automated spatial prediction. International Journal of Geographical Informa-tion Science 13, 143–158.

Stockwell, D.R.B., 1992. Machine Learning and the Problem of Prediction and Expla-nation in Ecological Modelling. Australian National University.

Author's personal copy

1366 A. Singh, S.P.S. Kushwaha / Ecological Modelling 222 (2011) 1354–1366

Stockwell, D.R.B., 1999. Genetic algorithms II. In: Fielding, A.H. (Ed.), Machine learn-ing methods for ecological applications. Kluwer Academic Publishers, Boston,pp. 123–144.

WII, 2009. List of Protected Areas in India. Dehradun, Uttarakhand, India.Wright, R.G., Maccracken, J.G., Hall, J., 1994. An ecological evalua-

tion of proposed new conservation areas in Idaho—evaluating

proposed Idaho national-parks. Conservation Biology 8,207–216.

Zarri, A.A., Rahmani, A.R., Singh, A., Kushwaha, S.P.S., 2008. Habitatsuitability assessment for the endangered Nilgiri Laughingth-rush: a multiple logistic regression approach. Current Science 94,1487–1494.