Advancing Spatio-temporal Analysis of Ecological Data: Examples in R

16
Advancing Spatio-temporal Analysis of Ecological Data: Examples in R Tomislav Hengl 1 , Emiel van Loon 1 , Henk Sierdsema 2 , and Willem Bouten 1 1 Research Group on Computational Geo-Ecology (CGE), University of Amsterdam, Amsterdam, The Netherlands [email protected] http://www.science.uva.nl/ibed-cge 2 SOVON Dutch Centre for Field Ornithology, Beek-Ubbergen, The Netherlands Abstract. The article reviews main principles of running geo-computa- tions in ecology, as illustrated with case studies from the EcoGRID and FlySafe projects, and emphasizes the advantages of using R computing environment as the most attractive programming/scripting environment. Three case studies (including R code) of interest to ecological applications are described: (a) analysis of GPS trajectory data for two gull-birds species; (b) species distribution mapping in space and time for a bird species (sedge warbler; EcoGRID project); and (c) change detection using time-series of maps. The case studies demonstrate that R, together with its numerous packages for spatial and geostatistical analysis, is a well-suited tool to pro- duce quality outputs (maps, statistical models) of interest in Geo-Ecology. Moreover, due to the recent implementation of the maptools and sp pack- ages, such outputs can be easily exported to popular geographical browsers such as Google Earth and similar. The key computational challenges for Computational Geo-Ecology recognized were: (1) solving the problem of input data quality (filtering techniques), (2) solving the problem of com- puting with large data sets, (3) improving the over-simplistic statistical models, and (4) producing outputs of increasingly higher level of detail. 1 Introduction Computational Geo-Ecology is an emerging scientific sub-field of Ecology that focuses on development and testing of computational tools that can be used to extract spatio-temporal information on the dynamics of complex geo-ecosystems. It evolved as a combination of three scientific fields: (a) Ecology, as it focuses on interactions between species and abiotic factors; (b) Statistics, as it implies quantitative analysis of field and remote sensing data; and (c) Geoinformation Science, as all variables are spatially referenced and outputs of analyzes are com- monly maps. The importance of this topic has been recognized at the Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, where a research group on Computational Geo-Ecology (CGE) has been established. It comprises about 20 researchers, PhD students and supporting staff mainly with backgrounds in physical geography, computer sciences, ecology and geosciences. O. Gervasi et al. (Eds.): ICCSA 2008, Part I, LNCS 5072, pp. 692–707, 2008. c Springer-Verlag Berlin Heidelberg 2008

Transcript of Advancing Spatio-temporal Analysis of Ecological Data: Examples in R

Advancing Spatio-temporal Analysis ofEcological Data: Examples in R

Tomislav Hengl1, Emiel van Loon1, Henk Sierdsema2, and Willem Bouten1

1 Research Group on Computational Geo-Ecology (CGE), University of Amsterdam,Amsterdam, The Netherlands

[email protected]://www.science.uva.nl/ibed-cge

2 SOVON Dutch Centre for Field Ornithology, Beek-Ubbergen, The Netherlands

Abstract. The article reviews main principles of running geo-computa-tions in ecology, as illustrated with case studies from the EcoGRID andFlySafe projects, and emphasizes the advantages of using R computingenvironment as the most attractive programming/scripting environment.Three case studies (including R code) of interest to ecological applicationsare described: (a) analysis ofGPS trajectory data for two gull-birds species;(b) species distribution mapping in space and time for a bird species (sedgewarbler; EcoGRID project); and (c) change detection using time-series ofmaps. The case studies demonstrate that R, together with its numerouspackages for spatial and geostatistical analysis, is a well-suited tool to pro-duce quality outputs (maps, statistical models) of interest in Geo-Ecology.Moreover, due to the recent implementation of the maptools and sp pack-ages, such outputs can be easily exported to popular geographical browserssuch as Google Earth and similar. The key computational challenges forComputational Geo-Ecology recognized were: (1) solving the problem ofinput data quality (filtering techniques), (2) solving the problem of com-puting with large data sets, (3) improving the over-simplistic statisticalmodels, and (4) producing outputs of increasingly higher level of detail.

1 Introduction

Computational Geo-Ecology is an emerging scientific sub-field of Ecology thatfocuses on development and testing of computational tools that can be used toextract spatio-temporal information on the dynamics of complex geo-ecosystems.It evolved as a combination of three scientific fields: (a) Ecology, as it focuseson interactions between species and abiotic factors; (b) Statistics, as it impliesquantitative analysis of field and remote sensing data; and (c) GeoinformationScience, as all variables are spatially referenced and outputs of analyzes are com-monly maps. The importance of this topic has been recognized at the Institutefor Biodiversity and Ecosystem Dynamics, University of Amsterdam, where aresearch group on Computational Geo-Ecology (CGE) has been established. Itcomprises about 20 researchers, PhD students and supporting staff mainly withbackgrounds in physical geography, computer sciences, ecology and geosciences.

O. Gervasi et al. (Eds.): ICCSA 2008, Part I, LNCS 5072, pp. 692–707, 2008.c© Springer-Verlag Berlin Heidelberg 2008

Advancing Spatio-temporal Analysis of Ecological Data 693

ECOGRID.NL

SPECIES:dragonflies, plants, fish

fungi, mollusca, mammals, butterflies, moss & lichens, birds

AUXILIARY DATA:geographical location,

date, landscape, taxonomy, socio-economic data

METADATA:taxonomy, lineage,

contact information, data quality

NDFF observations

QUERY PARAMETERS:species, period,

area, type of analysis, outputs...

Spatio-temporal data miningDensity estimation Geostatistical analysisTrend analysis and change detectionHabitat mapping Error propagationInteractive visualization

Summary statistics

Distribution maps

Change indices

Biodiversity indices

Home range

Scenario testing

REPORTSBASE MAPS

ECOLOGICAL CONDITIONS: distance to man-made objects,

distance to water and food supplies, land use, hydrology,

climate, geology

Fig. 1. Workflow scheme and main components of the EcoGRID. See further someconcrete case studies from the EcoGRID in Sec. 2.3 and 2.4.

The key objective of this group is to develop and apply computational tools1

that implement theoretical models of complex geo-ecosystems calibrated by fieldobservations and remote sensing data, and that can be used to perform varioustasks: from spatio-temporal data mining to analysis and decision making.

CGE is, at the moment, actively involved with two research projects:EcoGRID and ESA-Flysafe. EcoGRID (www.ecogrid.nl) is a national projectcurrently being applied in supporting the functioning of the growing Dutch Floraand Fauna Database (NDFF), which contains about 20 million field records ofmore than 3000 species registered in the Dutch Species Catalogue (www.neder-landsesoorten.nl). EcoGRID aims at providing researchers, policy-makers andstake-holders with relevant information, including distribution maps, distribu-tion change indices, biodiversity indices, estimated outcomes for scenario-testingmodels [ 1]. To achieve this, a set of general analysis procedures is being imple-mented and tested — ranging from spatio-temporal data mining, density es-timation, geostatistical analysis, trend analysis and change detection, habitatmapping, error propagation and interactive visualization techniques (Fig. 1).EcoGRID is the Dutch segment of the recent pan-European initiative called“LifeWatch” (www.lifewatch.eu), which aims at building a very large infras-tructure (virtual laboratories) to support sharing of the knowledge and tools tomonitor biodiversity over Europe.

1 By ‘tools’ we mainly refer to various software solutions: stand-alone packages, plug-ins/packages and toolboxes, software-based scripts, web-applications and computa-tional schemes.

694 T. Hengl et al.

ESA-Flysafe is a project precursor to the Avian Alert initiative (www.avian-alert.eu), a potential integrated application promotion programme (IAP) ofthe European Space Agency. CGE has already successfully implemented a na-tional project called BAMBAS (www.bambas.ecogrid.nl), which is now usedas a decision support tool by the Royal Netherlands Air Force to reduce the riskof bird-aircraft collisions [ 2]. The objective of Flysafe is to integrate multi-sourcedata into a Virtual laboratory, in order to provide predictions and forecasts ofbird migration (bird densities, species structure, altitudes, vectors and velocities)at different scales in space and time [ 2].

This paper reviews the most recent activities of the CGE group, discusseslimitations and opportunities of using various algorithms and sets a researchagenda for the coming years. This is all illustrated with a selection of real casestudies, as implemented in the R computing environment. Our idea was not toproduce an R tutorial for spatial data analysis, but to demonstrate some commonprocessing steps and then emphasize advantages of running computations in R.

2 Examples in R

2.1 Why R?

“From a period in which geographic information systems, and latergeocomputation and geographical information science, have been agenda setters,there seems to be interest in trying things out, in expressing ideas in code, and

in encouraging others to apply the coded functions in teaching and appliedresearch settings.”Roger Bivand [ 3]

The three most attractive computing environments to develop and implementcomputational schemes used in Computational Geo-Ecology are R (www.r--project.org), MATLAB (www.mathworks.com) and Python (www.python.org).The first offers less support and instructions to beginners, the second has morebasic utilities, is easier to use and the third is the most popular environmentused for software development. Although all three are high level languages withextensive users’ communities that interact and share code willingly, R seems tobe the most attractive candidate for implementation of algorithms of interestto CGE. [ 3] recognizes three main opportunities for using R: (1) vitality andhigh speed of development of R, (2) academic openness of developers and theirwillingness to collaborate, and (3) increasing sympathy for spatial data analysisand visualization. Our main reasons to select R for our projects are:

� R supports various GIS formats via the rgdal package, including the exportfunctionality of vector layers and plots to Google Earth (maptools package).

� R offers a much larger family of methods for spatio-temporal analysis (pointpattern analysis, spatial interpolation and simulations, spatio-temporal trendanalysis) than MATLAB.

� Unlike MATLAB, R is an open-source software and hence does not requireadditional investments and is easy to install and update.

Advancing Spatio-temporal Analysis of Ecological Data 695

Several authors have recently drawn attention to new R packages used forspatial data analysis. [ 4, 5] promotes the gstat and sp packages that togetheroffer variety of geostatistical analysis; [ 6, 3] reviews spatial data analysis pack-ages in general with special focus on maptools and GRASS packages that havebeen established as the most comprehensive links between statistical and GIScomputing; [ 7] presents the spatstat package for analysis of point patterns. Weshould also add to this list: RSAGA — link to the SAGA GIS, spsurvey — pack-age for spatial sampling, geoR — geostatistical analysis, splancs — spatial pointpattern analysis, and the specialized ecological data analysis packages: adehabi-tat [ 8], GRASP and BIOMOD, that support spatial prediction of point-sampledvariables using GLM/GAMs, and export to GIS. For an update on most recentactivities connected with the development of spatial analysis tools in R, you canat any time subscribe to the R-sig-Geo mailing list and witness the evolution.

A limitation of R is that it does not provide dynamic linked visualizationand user-friendly data exploration. This might frustrate users that wish to zoominto spatial layers, visually explore patterns and open multiple layers over eachother. However, due to the recent implementation of the maptools, rgdal and sppackages, outputs of spatial/statistical analysis in R can be exported to free geo-graphic browsers such as Google Earth. Google Earth is a HTML-language basedfreeware that has, with its intelligent indexing of very large datasets combinedwith an open architecture for integrating and customizing new data, revolution-ized the meaning of the word “geoinformation”. By combining computationalpower of R and visualization possibilities of Google Earth, one creates a completesystem.

The following sections demonstrate use of R scripting to perform various an-alyzes, including the export to Google Earth. We are not able to display thecomplete scripts, but we instead zoom into specific processing steps that mightbe of interest to research unfamiliar with R. For a detailed introduction to spa-tial analysis in R, please refer to the recent books by [ 9] and [ 10], and variouslecture notes [ 11;12].

2.2 Analysis of GPS Trajectory Data

The objective of this exercise is to analyze movement of two gull bird species —lesser black-backed gull (Larus Fuscus), further in text referred to as LBG, andeuropean herring gull (Larus Argentatus Pontoppidan), further in text referredto as HG. For the analysis, we use the GPS readings of the receivers attachedto a total of 23 individual birds. The birds were released on 1st of June 2007 inthe region of Vlieland, the Netherlands, and then recordings collected until 24thof October 2007. A map of trajectories is shown in Fig. 2a. We are interestedto see where do gulls forage and rest, do they have specific paths, how fast dothey move over an area and is there a relationship between activity centers andlandscape?

We can import the raw table data to R using:

> gulls <- read.delim("gulls.txt")

696 T. Hengl et al.

This shows the following structure:

’data.frame’: 13530 obs. of 6 variables:

$ BIRDID : Factor w/ 23 levels "ID41745","ID41747",..: 1 1 1 1 1 1 ...

$ LATITUDE : num 53.2 53.3 53.3 53.2 53.2 ...

$ LONGITUDE: num 4.93 4.95 4.96 5.00 5.04 ...

$ SPECIES : Factor w/ 2 levels "HG","LBG": 2 2 2 2 2 2 2 2 2 2 ...

$ SEX : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...

$ TIME : POSIXct, format: "2007-06-01 07:00:00" "2007-06-01 09:00:00"

We first want to calculate the velocity of birds moving from one to otherlocation. To do this, we need to reproject the geographical coordinates to acartesian system, so that the distances in x and y directions are equal. We startby attaching the coordinates and the coordinate system (sp package):

> library(sp)> coordinates(gulls) <- ~LONGITUDE+LATITUDE> proj4string(gulls) <- CRS("+proj=longlat +datum=WGS84")

which will coerce the table into a SpatialPointsDataFrame, i.e. an R spatiallayer. Now we can transform the coordinates to the European Terrestrial Refer-ence System (www.euref.eu) using:

> gulls.laea <- spTransform(gulls, CRS("+init=epsg:3035"))

where spTransform is the rgdal method to reproject the coordinates, epsg is theEuropeanPetroleumSurveyGroup’s registry code (see www.epsg-registry.org)of the ETRS coordinate system. Once the coordinates are in a metric system, wecan derive distances and velocities (in km/h) from point to point, by dividing thedistance vector by time. This will attach to each point location an estimate of thevelocity for an individual bird.

At this stage, we want to separate the analysis for the two species. This canbe done in few steps, e.g. for LBG species:

> gulls.laea$LBG <- ifelse(gulls.laea$SPECIES=="LBG",ifelse(is.na(gulls.laea@data$VELOC), NA, T), NA)

> LBG <- !is.na(gulls.laea$LBG)

We proceed with interpolating the velocities estimated at point locations. Forthis, we can use the gstat package [ 4]. The variograms can be visualized andfitted using:

> library(gstat)

> LBG.points <- remove.duplicates(gulls.laea[LBG,], zero=0.0, remove.second=TRUE)

> plot(variogram(log1p(log1p(VELOC))~1, LBG.points, cutoff=40000))

> LBG.ovar <- variogram(log1p(log1p(VELOC))~1, LBG.points, cutoff=40000)

> LBG.ovgm <- fit.variogram(LBG.ovar, vgm(0.4, "Exp", range=30000, 0.5))

where remove.duplicates is the function to remove duplicate point and pre-vent the package for running into computational problems, variogram is the

Advancing Spatio-temporal Analysis of Ecological Data 697

Fig. 2. Spatio-temporal analysis of GPS trajectory data for lesser black-backed gull(Larus Fuscus): (a) observed trajectories for a total of 14 individual birds; (b) in-terpolated velocities; (c) Habitat Suitability Index derived using mean annual EVI,topographic wetness index and night lights image; (d) 3D kernel density aggregatedfor the period June–August 2007.

gstat command to calculate semivariances for a given target variable (VELOC),log1p is the log transformation (double log in this case), fit.variogram is thegstat command used to fit the variogram using re-weighted least squares andcutoff is the maximum distance of interest. Once we fitted the variogram, we

698 T. Hengl et al.

can interpolate the values over the whole area of interest using e.g. ordinarykriging:

> LBG.g <- gstat(id=c("VELOC"), formula=log1p(log1p(VELOC))~1,data=LBG.points, model=LBG.ovgm)

> LBGspeed.OK <- predict.gstat(object=LBG.g, newdata=maskLBG, nmax=60)

where gstat is the generic function to run predictions and simulations, maskLBGis a SpatialGridDataFrame showing the prediction locations, and nmax=60 isthe maximum number of point pairs that will be used to make predictions. Theresulting map can be seen in Fig. 2b.

We are next interested to see how is the movement of the gulls connected withenvironmental conditions, i.e. relief, urban development and landscape in gen-eral. For this, we use three 1–km maps: (1) mean annual EVI image (pcevi1lbg)representing the mean biomass over a terrain, (2) topographic wetness index de-rived from the 1 km DEM (twilbg) representing relief, and (3) night lights image(nlightslbg) representing urban development [ 11]. These can be loaded to Rvia the adehabitat package:

> library(adehabitat)> pcevi1lbg.asc <- import.asc("pcevi1lbg.asc")> twilbg.asc <- import.asc("twilbg.asc")> nlightslbg.asc <- import.asc("nlightslbg.asc")> LBG.maps <- as.kasc(list(dem=demlbg.asc, pcevi1lbg=pcevi1lbg.asc,

twilbg=twilbg.asc, nlightslbg=nlightslbg.asc))

The maps and locations where LBG species has been observed can be packedtogether and used to run Ecological Niche Factor Analysis [ 8]:

> LBG.hab <- data2enfa(LBG.maps, gulls.laea[LBG,]@coords)> enfa.LBG <- enfa(dudi.pca(LBG.hab$tab, scannf=FALSE), LBG.hab$pr, nf=2)> LBG.dist <- predict(enfa.LBG, LBG.hab$index, LBG.hab$attr)

where data2enfa will combine the grids and observation locations, enfa is amethod to run Ecological Niche Factor Analysis and LBG.dist are the outputdistances from the barycentre of the niche [ 8]. Final Habitat Suitability Index(0–100%) can be seen in Fig. 2c. This shows that LBG birds, based on thistrajectory data, systematically avoid mountain chains and big urban areas.

By using package splancs [ 13;3], we can also estimate the space-time kerneldensity for different time intervals. A space-time (3D) kernel filter can be run bydefining: coordinates of the points and time reference (x, y, z), grid of interest,and search radius.

> LBG.densnoTime <- kernel3d(pts=LBG.points@coords, times=LBG.points$CTIME,

xgr=seq(maskLBG@bbox["x","min"], maskLBG@bbox["x","max"],

maskLBG@grid@cellsize[1]), ygr=seq(maskLBG@bbox["y","min"],

maskLBG@bbox["y","max"], maskLBG@grid@cellsize[2]),

zgr=seq(3608,8000,72), hxy=20000, hz=168)

which will produce 61 maps of kernel smoother densities for 3-day periods. Theoutput is basically a space-time cube, i.e. a series of 61 grid maps. Note that such

Advancing Spatio-temporal Analysis of Ecological Data 699

calculations can be time-consuming, thus we recommend that you test the scriptusing relatively small data sets first, and then proceed with the real case studies.The summary density map estimated using kernel3d method, aggregated overthe period June, July, August, is shown in Fig. 2d.

2.3 Generation of Distribution Maps

The following section demonstrates how to connect to an on-line ecologicaldatabase (ecogrid.nl), run queries, generate distribution maps and export themto geographical browsers such as Google Earth. We start by setting-up a newODBC connection on our Windows machine2, and then connecting to it from Rusing the RODBC package:

> library(RODBC)

> ecogrid.conn <- odbcConnect(dsn="ecogrid.nl", connection="sovon-ecogrid",

case="postgresql")

which will create an R definition of the connection. We can now run a query e.g.to fetch all observations of breeding pair counts of the bird species sedge warbler(Acrocephalus schoenobaenus):

> Acrocephalus.tbl <- sqlQuery(ecogrid.conn, query=paste("SELECT o.countmin,

o.countmax, o.counttype, x(centroid(l.the_geom)), y(centroid(l.the_geom)),

o.timestart, o.timestop, o.timetype FROM survey.observations o,

survey.locations l, taxonomy.taxa t, taxonomy.taxa p WHERE o.locid=l.locid

AND o.taxid=t.taxid AND t.parent_id=p.taxid AND p.taxon ILIKE

’Acrocephalus’ AND t.taxon ILIKE ’schoenobaenus’;"))

where o.countmin, o.countmax are the observed counts of the breeding pairs,x(centroid(l.the geom)), y(centroid(l.the geom)) are the coordinates ofthe center of the observation plots and o.timestart, o.timestop is the timeof beginning and the end of observation. As a result of query, we obtain 16,028observations (Fig. 3a):

> str(Acrocephalus.tbl)’data.frame’: 16028 obs. of 8 variables:$ countmin : int 0 0 0 0 0 0 0 0 5 8 ...$ countmax : int 0 0 0 0 0 0 0 0 5 8 ...$ counttype: Factor w/ 1 level "n": 1 1 1 1 1 1 1 1 1 1 ...$ x : num 115663 115663 115663 115663 115663 ...$ y : num 425971 425971 425971 425971 425971 ...$ timestart: POSIXct, format: "1984-04-20" "1985-04-20" ...$ timestop : POSIXct, format: "1984-08-16" "1985-08-16" ...$ timetype : Factor w/ 1 level "f": 1 1 1 1 1 1 1 1 1 1 ...

2 Under Control panel �→ Administrative tools �→ ODBC Data Source Administration�→ Add, then enter the server address, port, username and password used to connectto server.

700 T. Hengl et al.

Fig. 3. Automated generation of distribution maps: (a) observed number of breedingpairs of Acrocephalus schoenobaenus in year 2000; (b) as shown in Google Earth; (c)breeding pair densitites for years 1984, 1992 and 2000 derived using regression-krigingover 1 km grid.

Before we can proceed with generation of the distribution maps, we need toconvert the time-coordinates to a linear system. For example, we can considerusing the cumulative number of days since 1970-01-01:

> Acrocephalus.tbl$ctime <- floor(unclass(Acrocephalus.tbl$timestart)/86400) +

( floor(unclass(Acrocephalus.tbl$timestop)/86400) -

floor(unclass(Acrocephalus.tbl$timestart)/86400) )

here we use the command unclass to get the time as a numeric vector3. Forexample, 1987-04-20 W. Europe Standard Time date-time value correspondsto a value of 6317 (hours since 1970-01-01).

3 This will convert time values to the number of hours since the beginning of 1970.This way we can run statistical analysis with such data.

Advancing Spatio-temporal Analysis of Ecological Data 701

Next, we import the predictor maps of the Netherlands that can be used tomap distribution of this bird:

> gridmaps <- readGDAL("dheight.asc")> names(gridmaps)[[1]] <- "dheight"> gridmaps$dtm <- readGDAL("dtm.asc")$band1> gridmaps$freat1 <- readGDAL("freat1.asc")$band1> gridmaps$lgn3dsee <- readGDAL("lgn3dsee.asc")$band1> gridmaps$sltdch1 <- readGDAL("sltdch1.asc")$band1> gridmaps$t10nhuis <- readGDAL("t10nhuis.asc")$band1> gridmaps$t10ntree <- readGDAL("t10ntree.asc")$band1> proj4string(gridmaps) <- CRS("+init=epsg:28992")

where dheight is layer showing the height of canopy, dtm is the LiDAR-derivedDigital Elevation Model, freat1 is the map showing duration of drainage,lgn3dsee is the distance from the coast line, sltdch1 is the density of theprimary water course, t10nhuis is the density of buildings from the 1:10k topo-maps, t10ntree is the density of trees, and epsg:28992 is the EPSG ID of theDutch coordinate system. Note that this gridmaps data set is fairly large as eachlayer consists of 910,000 grids.

We can select a specific year/period and subset the original data set, e.g. toyear 2000:

> Acrocephalus.2000 <- subset(Acrocephalus.tbl,as.integer(format(Acrocephalus.tbl$timestop, "%Y"))=="2000")

and plot the values of bird counts using (Fig. 3a):

> coordinates(Acrocephalus.2000) <- ~x+y

> bubble(Acrocephalus.2000["count"], scales=list(draw=TRUE), main="Acrocephalus

schoenobaenus in 2000", sp.layout=list("sp.lines", col="black", NLborders))

The point maps (also lines and polygons) can be exported and viewed inGoogle Earth using the writeOGR command:

> Acrocephalus.2000.latlong <- spTransform(Acrocephalus.2000, CRS("+proj=longlat"))

> writeOGR(Acrocephalus.2000.latlong["count"], "Acrocephalus_2000.kml",

"Acrocephalus_schoenobaenus2000", "KML")

which will produce an image as shown in Fig. 3b.A list of distribution maps can be generated by using regression-kriging tech-

nique, as implemented in the gstat package [ 4;11]. We first need to overlay rastersand points and then attach the values of predictors to the original data frame.To speed up data processing, we can also loop the operations: linear regression,then step-wise regression, then variogram modeling, and final interpolation usingregression-kriging. Note that, because all operations are automated, you mightget some strange results, either in the whole map, or at a specific grid nodes, sothat it might be a good idea to inspect/visualize all fitted models and outputmaps before you proceed with interpretation of the final outputs.

The final results of interpolation (Fig. 3c) show that the density of Acro-cephalus schoenobaenus in the Netherlands has been increasing over the period

702 T. Hengl et al.

of last 10–15 years. But how distinct is the change in breeding pair counts, wewill answer in the following exercise.

2.4 Trend Analysis and Change Detection

In the last example we demonstrate how robust linear regression can be usedto map rate of change using time-series of distribution maps. For this we usea series of 17 distribution maps derived in the previous exercise: Acrocephalusschoenobaenus in the Netherlands from 1984 trough 2000 (Fig. 3c). We can runthe analysis by selecting only the grid nodes with attached value:

> count01 <- rk.Acroephalus.1984[mask,]$count> count02 <- rk.Acroephalus.1985[mask,]$count...> count17 <- rk.Acroephalus.2000[mask,]$count

where mask is the selection of grid nodes in the area1km map that are notundefined. The distribution values can be packed together to a new data frame:

> counts <- cbind(count01, count02, ... , count17)

To view how values change at individual pixel, we can use:

> plot(counts[1112,])> abline(coef(line(1:17, counts[1112,])))

which will produce a plot as shown in Fig. 4b. This shows a clear increase ofbreeding pair counts over the period 1984–2000.

Next we proceed with fitting the trend models for each pixel in the map. First,we make an empty data frame that will later be filled with fitted values:

> linefits <- as.data.frame(rep(0, length(counts[,1])), optional=TRUE)> linefits$beta0 <- rep(0, length(counts[,1]))> linefits$beta1 <- rep(0, length(counts[,1]))> linefits$residual <- rep(0, length(counts[,1]))> linefits[1] <- NULL

and then run a loop that will fit robust linear regression models for each pixel.We can then copy each result to the linefits data frame using a loop:

> for (i in 1:length(counts[,1])) {

assign(paste("line",i,sep=""), line(1:17, counts[i,]))

linefits$beta0[[i]] = get(paste("line",i,sep=""))$coefficients[1]

linefits$beta1[[i]] = get(paste("line",i,sep=""))$coefficients[2]

linefits$sumres[[i]] = sum((get(paste("line",i,sep=""))$residuals)^2)

}

where beta0 is the intercept coefficient, beta1 is the slope coefficient, and sumresis the sum of squared residuals. The last parameter will be used to quantify the

Advancing Spatio-temporal Analysis of Ecological Data 703

Fig. 4. Trend analysis and change detection: (a) absolute change in distributions ofAcrocephalus schoenobaenus breeding pair in period 2000 − 1984; (b) example of dis-tribution dynamics at a specific (grid node) location; (c) mapped slope (beta) indexfor a robustly-fit linear model for period 1984–2000 — positive values indicate increaseand negative decrease in counts; (b) delineated areas where the rate of change >0.3.

quality of the fit — if sumres tends to zero, we speak about obvious trend,otherwise if sumres→ ∞ the trend is less reliable. Note that, in this case study,there are only 43,207 grid nodes with values (from 91,000) but the processingcan still take up to 30 minutes on a standard PC.

We further convert the raster map to a point map, so that the grid nodes inthe linefits can be exported to a viewer such as Google Earth:

> counts.ll = spTransform(rk.Acrocephalus.1984, CRS("+proj=longlat"))

and then copy the fitted parameters (beta0, beta1 and sumres) to the counts.ll(SpatialPointsDataFrame):

704 T. Hengl et al.

> counts.ll$beta0 = linefits$beta0> counts.ll$beta1 = linefits$beta1> counts.ll$sumres = linefits$sumres

Export of raster maps from R to Google Earth is somewhat more complicatedbecause we first need to create a grid in the longlat coordinate system. Westart by determining the width correction factor4 based on the latitude of thecenter of the study area:

> corrf <- (1+cos((counts.ll@bbox["y","max"]+counts.ll@bbox["y","min"])/2*pi/180))/2

and then estimate the grid cell size in arcdegrees:

> geogrd.cell <- corrf*(counts.ll@bbox["x","max"]-counts.ll@bbox["x","min"])/

counts@[email protected][1]

which gives cell size of 0.0113152 arc-degrees. Now, we can generate a new griddefinition using the spsample method of the sp package:

> geoarc <- spsample(counts.ll, type="regular", cellsize=c(geogrd.cell,geogrd.cell))

> gridded(geoarc) <- TRUE

> gridparameters(geoarc)

cellcentre.offset cellsize cells.dim

x1 3.316779 0.01131520 347

x2 50.752184 0.01131520 251

which shows that the new grid will have approximately the same number of gridnodes as the original map in the Dutch coordinate system (87,348 compared to91,000 pixels). Further steps needed to generate a PNG of an R plot and thenexport to KML are explained in [ 11]. The fitted values of beta1, visualized inthe Google Earth viewer, are shown in Fig. 4c. Fig. 4d shows locations wherebeta1> 0.3, which are typical maps of interest for decision making.

3 Discussion and Conclusions

The case studies listed previously demonstrate that R computing environmentis a well-suited tool to produce quality outputs (maps, statistical models) ofinterest in Geo-Ecology. In principle, all operations listed before are completelyautomated. This allows us to combine various operations, ranging from generalpoint pattern analysis, geostatistics to habitat suitability mapping, via R script-ing and develop complex automated mapping frameworks. Moreover, due to the

4 For datasets in geographical coordinates, a cell size correction factor can be estimatedas a function of the latitude and spacing at the equator: Δxmetric = F ·cos(ϕ)·Δx0

degree;where Δxmetric is the East/West grid spacing estimated for a given latitude (ϕ),Δx0

degree is the grid spacing in degrees at equator, and F is the empirical constantused to convert from degrees to metres [ 14].

Advancing Spatio-temporal Analysis of Ecological Data 705

recent implementation of the maptools and sp packages, such outputs can be eas-ily exported to popular geographical browsers such as Google Earth and sharedwith the wider community [ 3;10].

Automated mapping and interactive data exploration have completelychanged the perspective of what is possible in Computational Geo-Ecology.In addition, outputs of such analysis add significant value to (dynamic) Geo-graphical Information Systems used for analysis of patterns and processes of(geo-)ecosystems[ 15]. However, there are also a number of research topics thatwill need to be tackled in the coming years. These are the key ones:

� Spatio-temporal visualization and data mining: The largest percent-age of tools developed for CGE applications are basically visualization anddata mining tools [ 16]. When one such tool is being developed, a range ofresearch questions need to be answered — how does a certain tool helps userscomplete various data mining tasks e.g. to analyze dependencies, detect out-liers, discover trends, visualize uncertainties? how well does it generalizesspatio-temporal patterns, and how easy is to zoom in into the data? howaccurate are the final outputs?

� Automated mapping and change detection: Because the quantity ofboth field and remote sensing data in ecology is exponentially increasing, itis also increasingly important to work with algorithms that do not require(much of) human labour/intervention. Automation is especially importantto be able to generate large numbers of target variables over dense time-intervals, and to rapidly detect changes in ecosystems.

� Multi-scale data integration: The input data that feeds the CGE mod-els often comes with large differences in temporal and spatial support sizeand effective scale. On the other hand, there are many benefits of runninganalysis that takes into account all possible correlations and dependencies.Can multi-scale/multi-source data be automatically filtered and integratedhow to achieve this?

� Modeling and management of the uncertainties: It is increasinglyimportant to accompany the data analysis report with a summary of theuncertainty budget. Such analysis then allows us to distinguish betweenconceptual (model), data (survey) errors and natural variation, i.e. betweenthe true spatio-temporal patterns and artefacts/noise. In many cases, in-formation about the inherent uncertainties in the input data can be usedto adjust or filter the data accordingly, pick the right effective scale andgeneralize/downscale where necessary.

� Implementation of algorithms and software development: Qualityof computational frameworks becomes apparent when they achieve imple-mentation in applied fields, especially outside their fields of origin [ 3]. Herea range of issues need to be addressed — how many operations does a pro-gramming language accommodates? what is the processing speed of the soft-ware? how compatible is it with various GIS formats (vector, raster)? howcompatible is it with various environmental applications? how ease-to-usewill it be? who will maintain the software and provide a support?

706 T. Hengl et al.

Our special focus in the coming years will be development of automatedspatio-temporal analysis algorithms that can be used to generate interactive(Google Earth-compatible) visualizations from large quantities of field and re-mote sensing data in near real-time. Although the tool is already there, ourexperience is that there are still many challenges to be solved in the comingyears:

• Solving the problem of low quality input data (field observations):This includes low precision of spatial referencing (size of the plots), im-precise quantities/counts, (mis)-classification errors, preferential sampling(complete omission of some area) etc. At this moment it is impossible toforesee how these inherent uncertainties (biased sampling, species classifica-tion errors, location errors, poor spatial/temporal coverage etc.) will affectthe final outputs, but it is on our agendas to report on this in the comingyears.

• Solving the problem of computing with large data sets: the DutchNational Database of Flora and Fauna contains observations of about 3000species, collected over 25 years at many thousands of locations. To producemaps using such large quantity of data, automated mapping tools will needto be developed. In addition, in order to be able to generate maps in near-realtime, super-computing will become unavoidable.

• Improving the over-simplistic statistical models: There are still evenfundamental statistical issues that need to be answered. For example, Rcurrently does not support a combination of non-linear regression models andgeostatistics5. This area of geostatistics is all fairly speculative and fresh, sowe can expect much development in the coming years [ 15]. We can only agreewith [ 15] — better predictions of geographical distributions of organismsand effects of impacts on biological communities can emerge only from morerobust species’ distribution models.

• Producing outputs of increasingly higher level of detail: The requiredlevel of detail important for decision-makers is increasingly high. This againasks for more powerful, faster and robust statistical models. A question re-mains if there are ways to make predictions at fine resolution using moreeffective computations?

The gull data was provided by Bruno Ens (SOVON, the Netherlands) andMichael Exo (Institute of Avian Research, Germany). This project is made pos-sible in part by the European Space Agency FlySafe initiative. The EcoGRIDproject is carried out in the context of the Virtual Laboratory for e-Scienceproject (www.vl-e.nl). This project is supported by a BSIK grant from theDutch Ministry of Education, Culture and Science and Dutch Ministry of Agri-culture, Nature and Food Quality.

5 Fitting a GLGM (generalized linear geostatitical model) is possible in geoRglm pack-age, but it requires two steps — fitting a model without correlation and then mod-elling residuals.

Advancing Spatio-temporal Analysis of Ecological Data 707

References

[1] Shamoun, J.Z., Sierdsema, H., van Loon, E.E., van Gasteren, H., Bouten, W.,Sluiter, F.: Linking Horizontal and Vertical Models to Predict 3D + time Distri-butions of Bird Densities. In: International Bird Strike Committee, Athens, p. 12(2005)

[2] Van Belle, J., Bouten, W., Shamoun-Baranes, J., van Loon, E.E.: An operationalmodel predicting autumn bird migration intensities for flight safety. Journal ofApplied Ecology 11, 864–874 (2007)

[3] Bivand, R.: Implementing Spatial Data Analysis Software Tools in R. GeographicalAnalysis 38, 23–40 (2006)

[4] Pebesma, E.J.: Multivariable geostatistics in S: the gstat package. Computers &Geosciences 30(7), 683–691 (2004)

[5] Pebesma, E.J., Bivand, R.S.: Classes and methods for spatial data in R. RNews 5(2), 9–13 (2005)

[6] Bivand, R.S.: Interfacing GRASS 6 and R. Status and development directions.GRASS Newsletter 3, 11–16 (2005)

[7] Baddeley, A., Turner, R.: Spatstat: an R package for analyzing spatial point pat-terns. Journal of Statistical Software 12(6), 1–42 (2005)

[8] Calenge, C.: The package “adehabitat” for the R software: A tool for the analysis ofspace and habitat use by animals. Ecological Modelling 197(3–4), 516–519 (2006)

[9] Waller, L.A., Gotway, C.A.: Applied Spatial Statistics for Public Health Data, p.520. Wiley, Hobokone (2004)

[10] Bivand, R., Pebesma, E., Rubio, V.: Applied Spatial Data Analysis with R. UseR Series, p. 400. Springer, Heidelberg (2008)

[11] Hengl, T.: A Practical Guide to Geostatistical Mapping of Environmental Vari-ables. In: EUR 22904 EN. Office for Official Publications of the European Com-munities, Luxembourg, p. 143 (2007)

[12] Rossiter, D.G.: Introduction to the R Project for Statistical Computing for use atITC. In: International Institute for Geo-information Science & Earth Observation(ITC), Enschede, Netherlands, p. 136 (2007)

[13] Rowlingson, B., Diggle, P.: Splancs: spatial point pattern analysis code in S-Plus.Computers & Geosciences 19, 627–655 (1993)

[14] Guth, P.L.: Slope and aspect calculations on gridded digital elevation models:Examples from a geomorphometric toolbox for personal computers. Zeitschrift furGeomorphologie 101, 31–52 (1995)

[15] Scott, J.M., Heglund, P.J., Morrison, M.L.: Predicting Species Occurrences: IssuesOf Accuracy And Scale. Habitat (Ecology), p. 840. Island Press, Washington, DC(2002)

[16] Compieta, P., Di Martino, S., Bertolotto, M., Ferrucci, F., Kechadi, T.: Ex-ploratory spatio-temporal data mining and visualization. Journal of Visual Lan-guages and Computing 18(3), 255–279 (2007)