Situating Ecology as a Big-Data Science - Oxford Academic

Overview Articles

https://academic.oup.com/bioscience August 2018 / Vol. 68 No. 8 • BioScience 563

Situating Ecology as a Big-Data Science: Current Advances, Challenges, and Solutions

SCOTT S. FARLEY, ANDRIA DAWSON, SIMON J. GORING AND JOHN W. WILLIAMS

Ecology has joined a world of big data. Two complementary frameworks define big data: data that exceed the analytical capacities of individuals or disciplines or the “Four Vs” axes of volume, variety, veracity, and velocity. Variety predominates in ecoinformatics and limits the scalability of ecological science. Volume varies widely. Ecological velocity is low but growing as data throughput and societal needs increase. Ecological big-data systems include in situ and remote sensors, community data resources, biodiversity databases, citizen science, and permanent stations. Technological solutions include the development of open code- and data-sharing platforms, flexible statistical models that can handle heterogeneous data and sources of uncertainty, and cloud-computing delivery of high-velocity computing to large-volume analytics. Cultural solutions include training targeted to early and current scientific workforce and strengthening collaborations among ecologists and data scientists. The broader goal is to maximize the power, scalability, and timeliness of ecological insights and forecasting.

Keywords: big data, cloud computing, ecoinformatics, open data, scalability

Ecology, like other branches of the biological and Earth sciences, has entered a world of big data, and

is confronting the attendant opportunities and challenges. Rates of ecological data generation, aggregation, and inter-pretation are increasing on many fronts, with rapid growth in data volumes, methods of data collection, and new analytical and computational approaches. Areas of large and growing ecological data streams include (a) the continual delivery of petabytes of data by remote sensors on Earth observing systems; (b) the aggregation of individual scientific observa-tions and experiments into larger curated community data resources (e.g., AmeriFlux, Global Biodiversity Information Facility, NutNet, and Neotoma); (c) investment in long-term ecological monitoring networks at national to continental scales (e.g., LTER, NEON, and CZO); (d) the deployment of automated and inexpensive sensor networks (e.g., phe-nology cameras, wildlife camera traps, and temperature loggers); and (e) citizen-science initiatives. The growth and aggregation of these data streams are exciting, because they create new opportunities to study ecological systems at high resolution and broad scales, better understand underlying processes, and improve ecological forecasting (Dietze 2017).

Ecological and ecoinformatic trends are situated within similar, larger societal trends. Worldwide data volume doubled nine times between 2006 and 2011, with exponen-tial growth continuing this decade (Chen et al. 2014); this

growth has outpaced the annual doubling in computing power predicted by Moore’s law (Olofson and Eastwood 2014). Genomics data volumes are doubling every 18 months (GenBank; www.ncbi.nlm.nih.gov/genbank/statistics). Big data have been embraced across the sciences, such as cli-mate modeling and analytics (Schnase et al. 2017), genomics (Howe et al. 2008), fluid dynamics (Pollard et al. 2016), and high energy and particle physics (Britton and Lloyd 2014). The data sciences are growing rapidly, with crosscutting opportunities in the private sector, government, and aca-demia (Blei and Smyth 2017). Therefore, this data growth and innovation, both inside ecology and beyond, creates a fertile ground for ecology to move toward its fundamental mission of understanding the interactions among all organ-isms and their environments across all scales and using this information in service of society.

However, big data present an array of challenges related to large data volumes, high data heterogeneity (variety), varying quality and uncertainty (veracity), and a need for timely information (velocity; Yang and Huang 2013, Chang 2015, LaDeau et al. 2017). Variety and veracity are the defin-ing challenges in ecoinformatics, given the heterogeneity of ecological data, practitioners, and ecosystems, but volume and velocity are growing rapidly. None of these challenges are conceptually new to ecologists, but the growth of data in all four dimensions challenges traditional approaches to

BioScience 68: 563–576. © The Author(s) 2018. Published by Oxford University Press on behalf of the American Institute of Biological Sciences. All rights reserved. For Permissions, please e-mail: [email protected]. doi:10.1093/biosci/biy068

Dow

nloaded from https://academ

ic.oup.com/bioscience/article/68/8/563/5049569 by guest on 11 August 2022

Overview Articles

564 BioScience • August 2018 / Vol. 68 No. 8 https://academic.oup.com/bioscience

data management and analysis. Solutions that work well at small scales (e.g., sharing spreadsheets by email, run-ning analyses on local computers, manually entering data, and basing veracity on personal reputation) may not scale up. Other challenges include complex relationships among highly varied ecological datasets that require sophisticated data models (Williams et al. 2018), computational costs of macroscale ecological forecasting (Dietze 2017), network bandwidth limits, and the need for flexible systems that support multiple and novel data uses (Lynch 2008, Schnase et al. 2017).

New solutions are arising to these challenges, both home-grown within ecology and drawn from other disciplines. Centers such as NCEAS, NESCent, or SESYNC have fos-tered a culture of data sharing and synthesis by showing how these collaborations can enable answers to previously inaccessible research questions and training a generation of intersectional ecologists and data scientists (Hampton and Parker 2011, Baron et al. 2017). Ecoinformatics (Michener and Jones 2012) is an established, growing, and active sub-discipline in ecology, with subfields including biodiversity informatics (Guralnick et al. 2007), paleoecoinformatics (Brewer et al. 2012), and open software initiatives (Boettiger et al. 2015). Systems are being actively developed to openly share data, methods, standards, and software, including col-laborative development environments such as GitHub and the Open Science Framework (https://osf.io), application programming interfaces (APIs) that expose scientific data resources to all, documented open scientific workflows, and linked data systems founded on unique resource identifiers. Advances in machine learning enable analysis of large data volumes, whereas Bayesian hierarchical models provide flexible analytical frameworks for integrating high-variety ecological data (Clark et al. 2016) and assessing multiple sources of uncertainty (Dietze 2017). Cloud computing has emerged as a flexible and configurable supply of on-demand computing services for high-velocity applications (Armbrust et al. 2009, Foster et al. 2016), complementing traditional institutional computing resources.

Several recent papers have called for ecology to become a big-data science and for ecologists to improve data-sharing practices for both pragmatic and ethical reasons (Hampton et al. 2013, Soranno et al. 2014); others have laid the con-ceptual and methodological foundations for ecoinformatics (Michener et al. 1997, Jones et al. 2006). Here, we posit that much of ecology is already a big-data science and review recent developments, challenges, and solutions. Because the phrase big data is often invoked but not always defined (but see LaDeau et al. 2017), we first review two frameworks for conceptualizing big data and its various dimensions (Manyika et al. 2011, Chang 2015). We then identify six areas where ecological data are big and growing, show how each maps onto different dimensions of big data, and note associated challenges. We review progress in developing solutions to these big-data challenges, focusing on (a) the growth of open data, software, and the cyberinfrastructure

to support them; (b) advances in quantitative approaches, especially in Bayesian statistics and machine learning; and (c) the advent of flexible cloud-computing platforms that allow on-demand provisioning. Finally, we call for the adop-tion of open-science approaches and other best practices, new training initiatives for both the next-generation and current scientific workforce, and building closer collabora-tions among ecological researchers, ecoinformaticists, and computer scientists. The ultimate goal is to develop, share, and adopt flexible and scalable systems for maximizing the ecological insights possible from big ecological data in their many varieties.

Defining big data: Two frameworksTwo prominent frameworks exist to characterize the “big-ness” of data. The first assesses data relative to the ability of an individual or discipline to manage and analyze the data, whereas the second establishes four dimensions of data bigness.

The first framework, developed by the National Institute of Standards and Technology, defines big data as data that either “exceeds the capacity or capability of current or con-ventional methods and systems” (Ward and Barker 2013) or consist of multiple datasets so large and complex that they become awkward to work with using existing methods (Snijders et al. 2012, Chen et al. 2014, Chang 2015). Similarly, Apache Hadoop, a popular distributed computing platform, describes big data in computational terms as “datasets which could not be captured, managed, and processed by general computers within an acceptable scope” (Chen et al. 2014). Under this framework, a dataset’s bigness is situational and depends on the capacities of the entity attempting to analyze it (Manyika et al. 2011). This framework well describes many ecologists’ experiences with big data as each of us adjusts to growing data volumes and variety. Therefore, absolute levels of bigness matter less than rates of data growth and scal-ability of solutions; as data grow, some analytical solutions scale up well, but others do not. In this framework, the scope of big data varies among disciplines and institutions: Data volumes and varieties that are unmanageably big in one dis-cipline may be routine to another. Therefore, many solutions to big-data challenges will come from lateral knowledge transfers within and outside ecology as best practices and big-data solutions are shared and adopted. Many of ecology’s solutions to its big-data challenges will be as much cultural as technological.

A second framework used to assess big data is the Four Vs Framework. This popular and flexible framework was first introduced by IBM in the early 2000s and is often used by technology companies to characterize data. In this framework, a dataset is described according to four charac-teristics: volume, variety, veracity, and velocity. Specifically, “volume refers to the size of the data, velocity indicates that big data are sensitive to time, variety means big data com-prise various types of data with complicated relationships, and veracity indicates the trustworthiness of the data” (Yang

Dow



Overview Articles


and Huang 2013). LaDeau and colleagues (2017) recently mapped these concepts onto data and practices in ecosystem ecology, such as extending variety to encompass variations in spatiotemporal scale and taxonomic level among eco-system datasets. Here, we build on this and other previous works (Michener and Jones 2012, Hampton et al. 2013) to identify examples of big-data systems in ecology (figure 1), explore how each maps onto these four axes, and describe some emerging solutions.

Volume. No fixed criterion defines the volume at which data become big, but the term typically refers to datasets on the scale of terabytes to exabytes (240 to 260 bytes). However, the size of big data varies among research domains and the datasets, software, and hardware used in those domains (Manyika et al. 2011). Within ecology, dataset volumes vary by orders of magnitude but are growing rapidly (figure 2). The largest data volumes traditionally have been gener-ated by remote sensors, which produce petabytes of data, from a variety of spaceborne and airborne platforms (Yang et al. 2010, Skytland 2012). Other biological and ecological data resources are also rapidly increasing in volume. The Fluxnet network of eddy flux covariance towers has over 500 sites, 200 recorded variables, and 1500 site-years of

data at subhourly resolution, with a raw data volume on the petabyte scale (http://fluxnet.fluxdata.org/data; Novick et al. 2018). Individual genomic repositories such as GenBank hold tens of millions of sequences (figure 2) and billions of bases (www.ncbi.nlm.nih.gov/genbank/statistics) and likely will exceed two exabytes by the middle of the next decade (Stephens et al. 2015), while microbial ecology has been transformed by rapidly decreasing sequencing costs and increasing rates of throughput (Xu 2006, Teitzel 2014). Among citizen-science efforts, eBird has over 21 million unique bird observations from over 180,000 locations since it began in 2002 (Sullivan et al. 2009). Camera trapping and automated sensor network datasets are growing, too. For example, the Tropical Ecology, Assessment, and Monitoring Network recorded over 3 million images of tropical mam-mals through 2017, with each field site capable of producing more than 8000 images per year (Ahumada et al. 2011).

Among biodiversity databases, the Global Biodiversity Information Facility (GBIF; www.gbif.org) houses well over 700 million digital records of field observations, living and fossil specimens, and reports from the scientific literature (figure 2a). GBIF’s holdings have grown nearly 300%, from about 180 million to 700 million records, between 2001 and 2016. Other examples include the Neotoma Paleoecology

Figure 1. Common types of big data in ecology, each illustrating one of the four Vs in big data: (1) Remote sensors of the earth system, mounted on a variety of platforms, which generate large data volumes. (2) Citizen-science efforts that collect data at volumes far beyond the capacity of scientific experts, by individuals with varying degrees of expertise. (3) Near real-time sensor networks that can deliver on-going data feeds at low latency and high velocity. (4) Field observations and experiments by scientists, across a wide variety of measurements, systems, and scales. This mapping of the four Vs to data types is illustrative; all four dimensions are present in all data types.

Dow



Overview Articles


Database, which holds over 3.5 million individual occur-rence records from over 14,000 datasets and associated spatial, temporal, and taxonomic attributes. Neotoma data volumes are growing at an average rate of 130,000 occur-rences per year (Williams et al. 2018), with rates of data addition accelerating (figure 2b). Similarly, the Paleobiology Database (PBDB) houses over 1.3 million fossil occurrences and is growing at a similar rate (figure 2c).

Variety. Variety, which describes the degree of data hetero-geneity and the complexity of relationships among data, is a defining feature of ecological data (Jones et al. 2006, Hampton et al. 2013, LaDeau et al. 2017). Many ecological variables can be measured—at many scales and for many organisms and ecosystems—by dispersed networks of many ecological researchers. The single greatest and ongoing chal-lenge in ecoinformatics is to provide structure and order to this welter of information (Jones et al. 2006, Hampton et al. 2013, LaDeau et al. 2017). Variety is closely linked to scal-ability. In big data, scalable solutions must be able to work across a wide range of data volumes and data types. High variety in data and methods limits the scalability of ecologi-cal science.

Multiple initiatives have been launched to tame eco-logical data variety. Controlled vocabularies and defined data structures help to efficiently assemble and

standardize ecological and biodiversity data (e.g., Ecological Markup Language, Michener et al. 1997; Darwin Core, http://rs.tdwg.org/dwc). The Long-Term Research Networks (LTERs), first formed as long-term monitoring and research stations, each with different protocols, have developed common data standards and shared data repositories that can house a wide variety of ecological data types (Servilla et al. 2016, Firbank et al. 2017). Individual research com-munities have banded together to share data, define com-mon experimental and observational protocols, and build platforms and governance systems for sharing data (e.g., NutNet, FLUXNET, Neotoma, and BIEN). Among biodi-versity resources, GBIF takes a seemingly simple observa-tion—the presence of a species at a location—and delineates nine separate ways of observing presences, including human observations, living and fossil specimens, literature reviews, and automated measurements, each with its own protocols and research communities. Similarly, Neotoma features 23 dataset categories, separated by taxonomic groups (e.g., plants, vertebrates, diatoms, and ostracodes) and measure-ment types (e.g., stable isotopes and organic biomarkers). The Data Observation Network for Earth (DataONE) oper-ates at a higher level, supporting a federated network of eco-logical data resources, called Member Nodes, and facilitates sharing of data search, data retrieval, computing resources, and best practices across nodes (www.dataone.org; Michener

1500 1600 1700 1800 1900 2000 1990 2000 2010

(a) Global Biodiversity Information Facility (b) (c) (d) Neotoma Paleoecolog y Database The Paleobiolog y Database GenBank

1x108

2x108

3x108

4x108

5x108

Cum

ulat

ive

num

ber

of o

ccu

rren

ces

1x106

2x106

3x106

1995 2000 2005 2010 2015

1.2x106

0.8x106

0.4x106

Date of Record Date of Submission Date of Submission Date of Submission

0.5x108

1x108

1.5x108

2x108

1990 2000 2010

Figure 2. Data volumes and rates of growth for four representative biodiversity and genomic databases: (a) the Global Biodiversity Information Facility, (b) the Neotoma Paleoecology Database, (c) the Paleobiology Database, and (d) GenBank.

Dow



Overview Articles


and Jones 2012). The World Data System International Council’s World Data Service (ICSU-WDS) also acts as a system of systems, accrediting scientific data resources with commitments to data quality and archival.

Ultimately, however, most individual ecological datasets are still produced by the work of dispersed individual researchers and research teams—the “long tail” of ecologi-cal data (Heidorn 2008, Hampton et al. 2013). Innovations in ecological methods also produce data variety. Ecological systems are complex, with many kinds of measurements necessary to elucidate underlying processes. Therefore, vari-ety can be reduced, but it cannot be eliminated; there will always be a dynamic tension between (a) standardizing and structuring existing data and (b) creating new and better ways of observing ecological systems.

Veracity. Veracity is a broad term that encompasses obser-vation error, expertise, and reliability of the primary data collectors, possible data corruption during secondary data management and analysis, and any other factor that might increase uncertainty. Some sources of uncertainty can be estimated or modeled, including spatial or temporal posi-tional uncertainty (Wing et al. 2005, Blaauw et al. 2010). New kinds of observations raise new questions about veracity, require checking against established methods, and lead to new approaches for minimizing uncertainty and maximizing signal. For example, in citizen-science efforts, protocols focus on observations that can be readily collected by citizens of varying expertise, with different protocols developed for dif-ferent populations of citizen scientists that vary by age, inter-est, and expertise (Tulloch et al. 2013, Birkin and Goulson 2015). Standardization and careful project design can miti-gate the veracity issues involved in these projects (Kosmala et al. 2016). Whenever possible, quantifiable measures of observation error (e.g., sampling error or calibration error) need to be included with data as a precondition for their use in data assimilation and forecasting (Dietze et al. 2018).

Many aspects of veracity, however, are challenging to quantify but are influential to conclusions (Meyer et al. 2016), such as rates of taxonomic misidentification (Stribling et al. 2008) and lab-scale quality-control practices. Often, ecologists address these issues by building small networks of trust, in which a scientist’s reputation is a guarantor for data quality. This solution to veracity works well at small to mesoscales but does not scale up well. Therefore, one big-data need is to build systems that can track data provenance and support expert data annotation.

Velocity. Velocity encompasses rates of data generation, rates of data processing and inference, and, most critically, mis-sion sensitivity to time. High-velocity data must be analyzed in real time to produce timely information and meaningful insights. For example, early warning systems for natural haz-ards such as tornadoes, tsunamis, and earthquakes require high-velocity analytical solutions (Blewitt et al. 2009). Twitter tweets can be analyzed for trends as they are posted,

enabling up-to-the-minute online conversations about real-time phenomena (Sakaki et al. 2013, Walther and Kaisser 2013, Smith et al. 2016). High-frequency financial trans-actions measure competitive advantage in microseconds (Aldridge 2013). In human–computer interactions, informa-tion delivery times must be less than 0.1 seconds for users to perceive instantaneous responses (Miller 1968). When velocity is the goal, reducing latency (the time lag between data need and delivery) becomes a research objective.

Velocity is the weakest fit between the Four Vs criteria for big-data science and current ecological scientific practice. Few ecological researchers seek to achieve real-time data analyses. Moreover, automated analyses of ecological data are warned against because of issues of variety and veracity (Soberón and Peterson 2004, Porter et al. 2005) and incom-plete error reporting (Dietze 2017).

However, ecology can and must increase its velocity, driven by growing capacity and societal need. Ecological data sources and analytical systems are increasingly able to support high-velocity analysis through automated sensor systems such as biotelemetry studies, camera traps and phenology cameras, social media, and eddy–flux towers. For example, animal biotelemetry produces continuous data streams that eliminate periods of unknown animal behavior and improve trend detection (Bouten et al. 2013, Dressler et al. 2016). Automated camera networks can quickly provide scientists and the public access to thou-sands of recorded images (Ahumada et al. 2011, Gomez Villa et al. 2017). Projects such as HealthMap can capture real-time trends in disease ecology through the automated processing of social media and news sources (Freifeld et al. 2008).

The societal need for high-velocity ecological data analyt-ics and forecasting (Joppa 2017, Dietze et al. 2018) grows as the pace of environmental change accelerates and the severity and synchrony of disturbance events increases (Betancourt 2012). In meteorology, attribution analyses of extreme weather events such as Hurricane Harvey to climate change can now be completed within weeks and published within a few months (Emmanuel 2017, van Oldenborgh et al. 2017). In MODIS data, there is a 3- to 5-hour latency between data capture and availability of raw imagery, whereas derived products that require multiple observations, such as land cover, are updated annually (NASA 2017). The Global Carbon Project (GCP) provides policymakers with annual updates on the state of the global carbon cycle (www.global-carbonproject.org).

These are promising developments, but overall, more work is needed to improve the velocity of ecological data analytics. Improved ecological forecasting requires tightly coupled and low-latency model-data loops that allow forecasting models to be continually and iteratively checked against data as new observations emerge (Dietze 2017, Dietze et al. 2018). Given the economic value of ecological services and their potential disruption by, for example, enhanced climate variability, there is a strong economic and societal incentive to increase

Dow



Overview Articles


ecological analytical velocity. For ecological knowledge to be useful, it must be timely.

Big-data systems in ecologyEcology is a heterogeneous discipline, so a first-order big-data challenge is simply understanding and encom-passing the variety of ecological data. Ecological data can be organized into data systems, defined here as a set of measurements collected by a research community and the accompanying analytical and conceptual frameworks for interpreting these measurements; a data system usually comprises many data types. Different research communities work with different data systems, and each community has different levels of expertise and historical investments in ecoinformatics. The degree and kind of big-data challenges and solutions vary among communities and data systems.

Here, we identify and summarize six emergent big-data systems in ecology: (1) remote sensing; (2) distributed networks of low-cost automated sensors; (3) biodiversity, genomics, and inventory databases; (4) community-curated data resources; (5) citizen science; and (6) national and international long-term ecological research and monitoring stations. We map each data system onto the four big-data dimensions (figure 3) and indicate growth dimensions that pose emergent challenges (figure 3 arrows).

Earth observing remote sensors mounted on airplane, sat-ellite, and drone platforms (figure 3a) are a long-standing source of high-volume ecological data, providing continuous spatial coverage of multiple ecosystem variables, including tree cover, greenness and ocean chlorophyll content, net pri-mary productivity, and habitat (Hansen et al. 2013, Schimel et al. 2013). With backing from NASA and other agencies, most remote sensing data have a history of good data man-agement and ready data availability. Data volumes are large, with datasets often including hundreds of gigabytes for sin-gle captures of some high-resolution sensors. Data volumes are increasing as, for example, hyperspectral sensors are deployed to capture more detailed information about foliar nutrients, canopy structure, and traits (Asner et al. 2016). Data velocities are limited by rates of postprocessing and need for repeat imagery (e.g., to remove clouds or capture phenological trends). Cost per sensor is decreasing as drones and cubesats proliferate, which will further increase data volume and variety (Anderson and Gaston 2013, Bryson et al. 2014). Integrating these spatially continuous data resources with other kinds of ecological data is a big-data frontier and opportunity to advance global-scale carbon cycle and biodiversity science (Schimel et al. 2013, 2015).

As sensor technology improves and costs decrease, dis-tributed networks of in situ automated monitoring sensors (figure 3b) provide an increasingly large portion of eco-logical data volume and support high-velocity scientific applications. Examples include eddy flux towers that moni-tor ecosystem–atmosphere exchanges of water, trace gases, and energy (Novick et al. 2018), camera traps to monitor animal behavior and movement (Ripperger et al. 2016,

Schadhauser et al. 2016), water chemistry and temperature loggers (Porter et al. 2005, Porter et al. 2009), and sound-scape microphones (Farina 2014). Costs vary widely, from precisely instrumented sensor systems such as eddy flux towers (Schimel et al. 2015, Novick et al. 2018) to camera traps, biotelemetry, microphones, temperature loggers, and other low-cost sensors (Hamel et al. 2013). These sensor networks can be high velocity, with, for example, phenologi-cal sensor networks (Richardson et al. 2007, Nasahara and Nagai 2015, Brown et al. 2016) capturing data every 30–60 minutes or instrumented monitoring systems, such as the USGS National Streamflow sensor network (USGS 2017), aiming to provide real-time data at macroscales. These sensors are subject to sensor drift, calibration errors, and problems with automated processing, creating challenges of veracity. Standardization can reduce variety, but the general trend is toward an increasing number and variety of sensors.

Biodiversity and inventory databases (figure 3c) aggregate many kinds of ecological measurements with the research objective of understanding the patterns and processes that govern biodiversity in space and time. Some databases store spatiotemporally explicit records of presence or abundance of individuals of a species or higher taxonomic group. Data volumes are in the order of millions to hundreds of millions of occurrences for species-level biodiversity data resources such as GBIF, Neotoma, and the Paleobiology Database (figure 2). The metanomics revolution, which extends far beyond biodiversity science, has opened new frontiers in species detection and biodiversity research, such as for microbial communities and other cryptic species (Hahn et al. 2016). Vegetation plot and inventory databases (e.g., the European Vegetation Archive and the USDA Forest Inventory Analysis) hold millions of observations from hundreds of thousands of plots (Dengler et al. 2011, Gray et al. 2012, Chytrý et al. 2016). Herbarium and museum col-lections provide both physical specimens and digital data on species distributions, past and present (Moritz and Agudo 2013). These data form the backbone of contemporary biodiversity analyses, tests of ecological and biogeographic theory, and assessments of species resilience and response to global environmental change (Moritz and Agudo 2013, Meyer et al. 2016). Data veracity and variety pose the largest challenges to these data resources (figure 3a), because their holdings depend on precise taxonomic and spatiotemporal identifications that are gathered by dispersed networks of ecological researchers, many no longer active. In the future, these data resources will face enhanced challenges of data volume (as rates of data upload increase; figure 2) and verac-ity (as more scientists contribute data; figure 3a).

Community-curated data resources (CCDRs; figure 3d) gather and curate data generated by communities of experts for the purpose of studying broader-scale phenomena. CCDRs add value and address challenges of variety and veracity by establishing common data models, nomencla-tures, and platforms for archiving and distributing data (Guralnick et al. 2007, Lehnert and Hsu 2015, Williams et al.

Dow



Overview Articles


2018). CCDRs usually center on research communities of practice who have banded together to pool their site-level or organismal-level data into larger resources. Examples include ecological trait databases such as TRY (Kattge et al. 2011); neo- or geo-biodiversity databases such as BIEN, Neotoma, or PBDB (figure 3c; Engemann et al. 2016); coor-dinated experimental networks (e.g., DroughtNet, NutNet, or the International Tundra Experiment; Oberbauer et al. 2013, Harpole et al. 2016, Felton and Smith 2017); or obser-vational monitoring networks such as AmeriFlux or TEAM (Ahumada et al. 2011, Novick et al. 2018). CCDRs are rarely big with respect to data volume but have a high cost per “sensor” (representing the training needed to achieve exper-tise) and high value per data point.

Citizen-science efforts (figure 3e) enlist the general public to participate in research (Sullivan et al. 2009, Hochachka et al. 2012, Chandler et al. 2017b). Citizen scientists are

able to intensively sample and provide fine-resolution data at lower costs than traditional expert-sourced methods while promoting public engagement (Hochachka et al. 2012). Citizen-science projects range in scope from local-scale, high-resolution projects, such as monitoring the impacts of small-scale pollution events (Hyder et al. 2017), to large-scale biodiversity projects involving thousands of volunteers for monitoring, such as butterfly or bird populations (Chandler et al. 2017a, Dennis et al. 2017, Sauer et al. 2017). Often, citizen science can validate exist-ing data, such as ground-truthing of remote sensing data (e.g., Gomez Villa et al. 2017) or calibrating and refining machine-learning systems for automated species identi-fication (e.g., iNaturalist; www.inaturalist.org). Because citizen-science efforts tap into even larger networks of observers of varying expertise, veracity is a central issue in designing citizen-science data resources (figure 3d).

(a) (b) (c)

(d) (e) (f)

Figure 3. Different kinds of big data in ecology, qualitatively scored according to the Four Vs framework: (a) remote sensors, (b) in-situ sensor networks, (c) biodiversity databases and inventories, (d) community-curated data resources, (e) citizen science, and (f) long-term ecological stations. Each panel presents the four axes of big data—velocity, variety, volume, and veracity—and a qualitative scoring of the degree of challenge faced by ecologists using each type of data. The arrows show the dimensions in which growth is particularly rapid for each type.

Dow



Overview Articles


Large-scale citizen-science platforms, including eBird, are growing rapidly, creating nontrivial problems of data stor-age and volume (Chi et al. 2017). The number of obser-vations collected by citizen scientists tends to be fewer, reducing ecological data variety (figure 3d).

Long-term networks of ecological monitoring stations (fig-ure 3f) provide long-term, co-located, multidimensional observations and experiments of environmental and eco-logical variables at targeted sites. Like biodiversity databases (figure 3c), permanent stations comprise many different kinds of sensors and measurements, so data variety is acute. Site networks promote continental-scale understand-ing of climate change, land use and habitat change, and invasive species impact on ecosystems. Examples include the National Ecological Observatory Network (NEON), the US and International Long Term Ecological Research networks (LTER), Critical Zone Observatories (CZO), the Terrestrial Ecosystem Research Network (TERN) and the South African Ecological Observatory Network (SAEON). In principle, because each site measures similar ecologi-cal variables, macroscale ecological analyses are possible. However, these networks face persistent challenges of data variety, within and across sites, which hinders high-vol-ume and high-velocity data analytics. For the LTERs, the Environmental Data Initiative gathers and archive many kinds of ecological data (https://environmentaldatainitiative.org/about/edi). More recent networks such as NEON have conducted extensive preplanning to design and implement common protocols, to reduce the confounding sources of variety in macroscale studies of ecological dynamics in space and time, and to better characterize sources of observational error. However, even here, data velocity remains a challenge.

Technological solutions to big dataMultiple solutions are being actively developed to tackle big data in ecoinformatics (Jones et al. 2006) and in the broader data sciences (Blei and Smyth 2017). Here, we highlight three areas that address various big-data challenges: (1) the devel-opment of open data and open cyberinfrastructure designed to support the curation, discovery, linking, and reusability of data and software among dispersed networks of scientists; (2) the growing adoption of flexible statistical approaches such as Bayesian hierarchical models and machine-learning techniques that integrate a wide variety of ecological data and processes; and (3) building increased research comput-ing capacity through cloud-computing resources that allow flexible and dynamic provisioning of computing resources for high-volume, high-velocity applications.

Building cyberinfrastructure for open data and shared solutions. For ecology to move forward and fully take advantage of the big-data revolution in an inclusive manner will require a concerted effort that includes faster uptake of new technolo-gies, improved data and workflow sharing, and enhanced translation and documentation of big-data services. No single individual or institution can house, curate, and

effectively analyze all forms of ecological data. Therefore, to best handle ecological variety and leverage the distributed expertise of ecological researchers, the development of eco-logical cyberinfrastructure must prioritize the openness of data, methods, standards, and code. The depth and rigor of ecological data analytics can be maximized by ensuring that solutions developed in one domain of ecological big data can be quickly adopted and vetted by others.

Multiple organizations and efforts are working to build an open architecture for scientific data. Nonprofit groups such as DataONE, the World Wide Web Consortium, Earth Science Information Partners, the Open Geospatial Consortium, and the International Council for Science are setting common metadata, semantic, and ontology standards (Madin et al. 2008), such as the FAIR Principles (Wilkinson et al. 2016), and holding open forums, such as the rOpenSci Unconference (http://unconf17.ropensci.org) and the Cyber4Paleo Community Development Workshop (http://cyber4pale0.github.io). rOpenSci and other groups are developing open software, using platforms such as GitHub (Boettiger et al. 2015). Ecological networks such as NEON and LTER are committing resources to building platforms to make all data available. Consortia of individual scientists are banding together to create community-curated data resources (Cutcher-Gershenfeld et al. 2017, Novick et al. 2018, Williams et al. 2018) and use these to understand macroscale ecological processes, such as those governing species distributions and diversity (Lamanna et al. 2014).

A variety of programmatic solutions are emerging to better link data resources to each other and to scientists, including (a) application programming interfaces (APIs), (b) documented scientific workflows and version-control software platforms such as GitHub, and (c) linked data systems for linking networks of data across the internet (Boettiger et al. 2015). Data resources are increasingly relying on APIs to make their data more open and avail-able to other scientists, without users needing to fully duplicate the data resource on their local computer. APIs can be packaged into libraries or code packages for various programming languages such as R (e.g., Goring et al. 2015, LeBauer et al. 2017). APIs also allow the democratization of complex data resources by providing users with access and analytical tools that do not require the installation of specialized software.

Version control systems and open-source platforms such as GitHub or BitBucket for sharing code make it increasingly easy to collaborate and share well-documented analytical workflows to help build scientific capacity for working with big data. Well-documented and well-engineered software is essential to good scientific practice (Borregaard and Hart 2016, Mislan et al. 2016), with an increasing variety of tools for sharing and porting code (Schmidt et al. 2017). For example, Jupyter and RMarkdown enable the creation and sharing of live documents with embedded code and visual-izations (Baumer et al. 2014, Allaire et al. 2015, Kluyver et al. 2016) and Docker containers have increased the portability

Dow



Overview Articles


of complex platform-dependent workflows (Boettiger et al. 2015, Devisetty et al. 2016). These open-science solutions are leading to new metrics of academic impact and produc-tivity (e.g., size and variety of researcher networks or usage rates for data and software).

Distributed ecological data resources can be connected through linked data approaches, unique resource indicators, and adoption of common standards and schema. Linked data are a cornerstone of the semantic web (Berners-Lee 2006), and linked data can be readily discovered and returned by webcrawlers and other algorithms. Linked data rely on unique resource identifiers (URIs) such as HTTP addresses or digital object identifiers (DOIs), which can be used to build links among data resources, such as linking datasets at a common spatial location, primary datasets to derived data-sets, datasets to publications, or individual data objects to the databases that contain them (Pampel et al. 2013). Schema describe the relationships among data within structured data resources. For example, the Friend of a Friend (foaf) standard provides a set of terms and naming conventions for people (http://xmlns.com/foaf/spec). The proliferation of schema and data standards is another informatics challenge, and various groups exist to manage and share schema, such as schema.org or the Open Geospatial Consortium (http://opengeospatialconsortium.org; Ma 2017). Schema can be given URIs, and semantic wikis such as schema.org provide community portals for creating, maintaining, crosswalking, and sharing schema. Whenever possible, existing schema should be adopted rather than contributing to the prolifera-tion of schema.

Advances in statistical modeling and ecological forecasting. Making inferences from big data requires analytical approaches that are ideally flexible enough to handle high-variety and mixed-veracity data and scalable enough to han-dle high-volume and high-velocity analytics. All approaches, of course, should ultimately allow the underlying scientific questions and hypotheses to be addressed. Here, we focus on two areas of rapid advances in big-data analytics: Bayesian statistics, which are generally flexible but often computa-tionally intensive and so may not scale well, and machine learning (ML), which is generally scalable but often excels at identifying pattern rather than making inferences about the underlying mechanism. The two approaches are not com-pletely distinct; for instance, some work has recharacterized ML in a Bayesian context (Theodoridis 2015).

Advances in Bayesian statistical methods are providing ecologists with solutions to ecological variety and veracity via approaches that can handle multiple layers of complex-ity in ecological processes, observations, and uncertainty (Clark 2005). Bayesian hierarchical models can flexibly integrate multiple data types that may vary in scale and res-olution, and they also allow the characterization of uncer-tainty associated with the data and underlying processes (Clark 2005, Cressie et al. 2009). In addition, Bayesian modeling is useful for representing and understanding the

structure of spatiotemporal dependencies in ecological data networks

Bayesian models are well suited for ecological forecasting, which describes the process of predicting future ecosystem states and their uncertainties (Dietze 2017). Timely and reliable ecological forecasts are critical for decision-making (Ascough et al. 2008, Dietze 2017). Ecological forecasting requires a close coordination between models and data, with iterative model-data loops employed to continually improve ecological forecasts as models are repeatedly tested and updated against new incoming data (Dietze 2017). As cyberinfrastructure advances improve access to increas-ingly large and varied data resources, Bayesian hierarchical models can combine models and data to account for and test hypotheses about system complexity and stochasticity. Using information about process estimates and their uncertainty can then highlight where to focus management interven-tion (Williams and Hooten 2016) or new data collection efforts (Hooten et al. 2012) to further decrease prediction uncertainty.

Although Bayesian methods provide the theoretical and methodological framework to connect processes and data across scales, they pose their own big-data challenges. Parameter estimation may be computationally intensive, or for models with many parameters, posterior data volumes may be large. New initiatives have focused on developing computationally efficient solutions for solving complex large-scale problems (Beaumont 2010, Hoffman and Gelman 2014, Rue et al. 2017).

Machine-learning approaches can make inferences based on data without imposing assumptions about the structure of the process or data (Olden et al. 2008). ML methods are typically iterative and decision based, in which decisions may depend on some optimization function (possibly with a nonlinear loss or gain function). Common examples in ecology include boosted regression trees, generalized linear models, and random forest models (e.g., Elith et al. 2006), but ML approaches also include more complex methods such as knowledge learning and analysis systems (Peters et al. 2014). Artificial neural networks, organized into layers of “neurons” or nodes, are the basis of “deep” ML and have become increasingly powerful as the number of nodes and layers increases.

Machine-learning methods are well suited for handling high-volume and high-variety data that are high dimen-sional, nonlinear, and exhibiting complex interactions. However, because ML algorithms are often highly complex, the ability to make inferences about underlying processes is often sacrificed. Machine-learning methods are well suited for parallel computing and so can often make high-velocity predictions from large-volume datasets, such as image rec-ognition in camera traps (Norouzzadeh et al. 2017). Some suggest that the entire process of scientific inquiry using big data may eventually be carried out using ML (Peters et al. 2014). Regardless, ML methods continue to rapidly advance and remain a frontier of big-data science.

Dow



Overview Articles


Cloud computing. Cloud computing helps address the chal-lenges of volume and velocity. Cloud computing provides on-demand access to pools of flexible computing resources that can be quickly launched and released with minimal effort to provide computational resources beyond the capac-ity of any individual scientist or institution (Mell and Grance 2011, Michener and Jones 2012, Hampton et al. 2013, Barone et al. 2017). Several resources have been built specifi-cally for scientific researchers via collaborative partnerships between federal agencies and universities. For example, XSEDE (www.xsede.org) provides no-cost access to high-performance computing resources for US scholars (Towns et al. 2014). Cyverse provides a cloud-based ecosystem of commuting tools, including the Atmosphere platform for accessing computing resources, for use in plant genetics and transcriptomics (Merchant et al. 2016). Jetstream extends these tools to researchers from other fields by providing a preconfigured library of domain specific virtual machine images and reproducible scientific workflows with unique DOIs (https://jetstream-cloud.org/tech-specs/index.php; Fischer et al. 2015). Similarly, the Department of Energy’s Systems Biology Knowledgebase (KBASE; http://kbase.us/new-to-kbase) provides computational resources and other cyberinfrastructure, accessed through a browser-based user interface, for the fields of plant and microbial physiology and community ecology (Arkin et al. 2016).

Furthermore, the rapid commercialization of cloud com-puting and the widespread availability of public cloud services through large providers such as Amazon (AWS), Google, and Microsoft have put an effectively unlimited supply of computing resources at scientists’ disposal. Cloud computing changes the cost model from users paying fixed costs associated with running and maintaining their own server networks to costs scaled to unit of usage (e.g., memory allocation and CPU). Therefore, the cloud lets users quickly scale resources according to current demand, allow-ing scientists to easily procure additional storage space or computing power as needed (Armbrust et al. 2009, Hassan 2011).

Cloud computing is increasingly used in biodiversity modeling and molecular and microbial modeling to handle high-volume computing applications (Hahn et al. 2016). For example, cloud-based computing was used to repeatedly model the ranges of over 11,000 marine spe-cies, with high efficiency (Candela et al. 2016); Google has launched the Google Genomics cloud-based platform; and Nephele is a cloud-computing pilot project at the National Institute of Health (Hahn et al. 2016). Other fields employing cloud computing include bioinformatics (Schatz et al. 2010), genomics (Stein 2010), and climate analytics (Schnase et al. 2017) to, for example, quickly provision the computing resources needed for short-duration but real-time dust-storm forecasting (Yang et al. 2010, Huang 2012). Contemporary climate analyt-ics often work with datasets too large to be transferred across networks, promoting the development of Climate

Analytics-as-a-Service, which integrate data storage and high-performance computing to perform data-proximal analytics (Schnase et al. 2017).

Cultural solutions to big data: Building the next generation of big-data ecologists and ecoinformaticistsUltimately, the broadest challenges to scaling up in ecology are cultural rather than strictly technological. Science is a human enterprise: Most ecological datasets are collected by individual practitioners, and scientific insight remains strongly rooted within the human capacity for creativity, synthesis, and inference. There is a growing disconnect between the practice of ecology as a big-data science and the scarcity of ecologists trained in big-data analysis approaches. Too often, big-data solutions are developed piecemeal, with custom implementations designed for highly specific applications, and little adoption beyond the originators. Therefore, many big-data solutions in ecology will be pri-marily social and cultural.

First, we must train ourselves: Ecologists must learn and develop new approaches to data storage, organization, distri-bution, and analysis, both within individual research groups and across our discipline. New tools and approaches must be effectively communicated, documented, and distributed to the community if all ecologists are to benefit. Efforts such as the Data and Software Carpentry workshops (Wilson 2014, Teal et al. 2015), which develop and share best practices, are vital for training the next (and current) generation of ecological researchers. Data sharing and the open develop-ment of software and workflows in the ecological sciences (using formats and platforms such as Jupyter notebooks, RMarkdown, GitHub, and BitBucket) help researchers adapt best practices to their applications, thereby speeding up the rate of scientific progress (Lowndes et al. 2017). Training also helps ecologists engage with informaticists so that application development is collaborative and accounts for the human dimensions of scientific practice (Downey and Pennington 2009).

Second, ecologists must better build partnerships with ecoinformaticists and welcome ecoinformatics as a core discipline within ecology. Ecoinformatics enables better science: larger data volumes brought to bear on the ques-tion of interest; capacity to gather and synthesize mul-tiple heterogeneous data streams; enabling specialists to quickly add, annotate, and improve existing data resources; start-to-finish workflows that maximize transparency and reproducibility; and better access to more sophisticated eco-logical models and advanced computing capabilities (Britton and Lloyd 2014). Ecoinformatics operates at the boundary between traditional ecology and the data sciences and helps development priorities stay closely attuned to research prior-ities (Michener and Jones 2012). Ecoinformatics has histori-cally operated at the margins of mainstream ecology—the domain of a few specialists and advocates (Jones et al. 2006, Michener and Jones 2012, Hampton et al. 2013, Soranno

Dow



Overview Articles


et al. 2014)—but the time is ripe to recognize ecoinformatics as an essential and growing subdiscipline of ecology.

One promising area of joint collaboration is the develop-ment of increasingly complex “science as service” platforms that allow standard workflows, analyses, and datasets to be shared among researchers, promoting reproducibility, improving scalability, and ensuring that these tools are available to a broad range of researchers, not simply those with high technical competence (Foster 2005, Foster et al. 2016, Grossman et al. 2016, Lenhardt et al. 2016, Schnase et al. 2017). As low-cost, cloud-based computing resources become increasingly available to researchers, analyses and models can be run on high-performance virtual hardware, alleviating some financial and computational limitations to scaling up. Creating this ecosystem of open, consistent, and cutting-edge scientific services running in the cloud will require new collaborations among domain ecologists, ecoin-formaticists, and computer scientists. These joint efforts are essential to winning the race between two ongoing and accel-erating trends: the scale and pace of environmental change and the generation of scientific data and knowledge that can help us understand, predict, and shape these changes.

AcknowledgmentsThis work was supported by the National Science Foundation (nos. 1241868, 1550707, and 1541002) and is a contribu-tion to the PalEON, Earth-Life Consortium, and Neotoma Paleoecology Database projects.

References citedAhumada JA, et al. 2011. Community structure and diversity of tropical

forest mammals: Data from a global camera trap network. Philosophical Transactions of the Royal Society B 366: 2703–2711.

Aldridge I. 2013. High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems, 2nd ed. Wiley.

Allaire J, Cheng J, Xie Y, McPherson J, Chang W, Allen J, Wickham H, Hyndman R. 2015. rmarkdown: Dynamic Documents for R. R pack-age version 1.6. R Foundation for Statistical Computing. (31 May 2018; https://cran.r-project.org/web/packages/rmarkdown/index.html)

Anderson K, Gaston KJ. 2013. Lightweight unmanned aerial vehicles will revolutionize spatial ecology. Frontiers in Ecology and the Environment 11: 138–146.

Arkin AP, et al. 2016. The DOE Systems Biology Knowledgebase (KBase). bioRxiv. (31 May 2018; https://doi.org/10.1101/096354)

Armbrust M, et al. 2009. Above the Clouds: A Berkeley View of Cloud Computing. Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. Report no. UCB/EECS-2009-28.

Ascough JC II, Maier HR, Ravalico JK, Strudley MW. 2008. Future research challenges for incorporation of uncertainty in environmental and eco-logical decision-making. Ecological Modelling 219: 383–399.

Asner GP, Knapp DE, Anderson CB, Martin RE, Vaughn N. 2016. Large-scale climatic and geophysical controls on the leaf economics spectrum. Proceedings of the National Academy of Sciences 113: E4043–E4051.

Baron JS, et al. 2017. Synthesis centers as critical research infrastructure. BioScience 67: 750–759.

Barone L, Williams J, Micklos D. 2017. Unmet needs for analyzing bio-logical big data: A survey of 704 NSF principal investigators. PLOS Computational Biology 13 (art. e1005755).

Baumer B, Cetinkaya-Rundel M, Bray A, Loi L, Horton NJ. 2014. R Markdown: Integrating a reproducible analysis tool into introductory

statistics. arXiv, Cornell University Library. (31 May 2018; https://arxiv.org/abs/1402.1894)

Beaumont MA. 2010. Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics 41: 379–406.

Berners-Lee T. 2006. Linked data. World Wide Web Consortium. (29 August 2017; www.w3.org/DesignIssues/LinkedData.html)

Betancourt JL. 2012. Reflections on the relevance of history in a nonsta-tionary world. Pages 307–318 in Wiens JA, Hayward GD, Safford HD, Giffen CM, eds. Historical Environmental Variation in Conservation and Natural Resource Management. Wiley.

Birkin L, Goulson D. 2015. Using citizen science to monitor pollination services. Ecological Entomology 40: 3–11.

Blaauw M, Bennet KD, Christen JA. 2010. Random walk simulations of fos-sil proxy data. Holocene 20: 645–649.

Blei DM, Smyth P. 2017. Science and data science. Proceedings of the National Academy of Sciences 114: 8689–8692.

Blewitt G, Hammond WC, Kreemer C, Plag H-P, Stein S, Okal E. 2009. GPS for real-time earthquake source determination, tsunami warning systems. Journal of Geodesy 83: 335–343.

Boettiger C, Chamberlain S, Hart E, Ram K. 2015. Building software, build-ing community: Lessons from the rOpenSci project. Journal of Open Research Software 3 (art. e8).

Borregaard MK, Hart EM. 2016. Towards a more reproducible ecology. Ecography 39: 349–353.

Bouten W, Baaij EW, Shamoun-Baranes J, Camphuysen KCJ. 2013. A flex-ible GPS tracking system for studying bird behaviour at multiple scales. Journal of Ornithology 154: 571–580.

Brewer S, Jackson ST, Williams JW. 2012. Paleoecoinformatics: Applying geohistorical data to ecological questions. Trends in Ecology and Evolution 27: 104–112.

Britton D, Lloyd S. 2014. How to deal with petabytes of data: The LHC Grid project. Reports on Progress in Physics 77 (art. 065902).

Brown TB, et al. 2016. Using phenocams to monitor our changing Earth: toward a global phenocam network. Frontiers in Ecology and the Environment 14: 84–93.

Bryson M, Reid A, Hung C, Ramos FT, Sukkarieh S. 2014. Cost-effective mapping using unmanned aerial vehicles in ecology monitoring applications. Pages 509–523 in Khatib O, Kumar V, Sukhatme G, eds. Experimental Robotics: The 12th International Symposium on Experimental Robotics. Springer.

Candela L, Castelli D, Coro G, Pagano P, Sinibaldi F. 2016. Species distribu-tion modeling in the cloud. Concurrency and Computation: Practice and Experience 28: 1056–1079

Chandler M, See L, Buesching CD, Cousins JA, Gillies C, Kays RW, Newman C, Pereira HM, Tiago P. 2017a. Involving citizen scientists in biodiversity observation. Pages 211–237 in Walters M, Scholes RJ, eds. The GEO Handbook on Biodiversity Observation Networks. Springer.

Chandler M, et al. 2017b. Contribution of citizen science towards interna-tional biodiversity monitoring. Biological Conservation 213: 280–294.

Chang WL. 2015. NIST Big Data Interoperability Framework: Volume 1, Definitions. National Institute of Standards and Technology, US Department of Commerce.

Chen M, Mao S, Liu Y. 2014. Big data: A survey. Mobile Networks and Applications 19: 171–209.

Chi H, Pitter S, Li N, Tian H. 2017. Big data solutions to interpreting com-plex systems in the environment. Pages 107–124 in Srinivasan S, ed. Guide to Big Data Applications. Springer.

Chytrý M, et al. 2016. European Vegetation Archive (EVA): An integrated database of European vegetation plots. Applied Vegetation Science 19: 173–180.

Clark JS. 2005. Why environmental scientists are becoming Bayesians. Ecology Letters 8: 2–14.

Clark JS, Nemergut D, Seyednasrollah B, Turner PJ, Zhang S. 2016. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data. Ecological Monographs 87: 34–56. doi:10.1002/ecm.1241

Dow



Overview Articles


Cressie N, Calder CA, Clark JS, Hoef JMV, Wikle CK. 2009. Accounting for uncertainty in ecological analysis: The strengths and limitations of hierarchical statistical modeling. Ecological Applications 19: 553–570.

Cutcher-Gershenfeld J, Baker KS, Berente N, Flint C, Gershenfeld G, Grant B, Haberman M, King JL, Kirkpatrick C, Lawrence B. 2017. Five ways consortia can catalyse open science. Nature 543: 615.

Dengler J, et al. 2011. The global index of vegetation-plot databases (GIVD): A new resource for vegetation science. Journal of Vegetation Science 22: 582–597.

Dennis EB, Morgan BJT, Brereton TM, Roy DB, Fox R. 2017. Using citizen science butterfly counts to predict species population trends. Conservation Biology 31: 1350–1361. (31 May 2018; http://dx.doi.org/10.1111/cobi.12956)

Devisetty UK, Kennedy K, Sarando P, Merchant N, Lyons E. 2016. Bringing your tools to Cyverse Discovery Environment using Docker. F1000Research 5 (art. 1442).

Dietze MC. 2017. Ecological Forecasting. Princeton University Press.Dietze MC, et al. 2018. Iterative near-term ecological forecasting: Needs,

opportunities, and challenges. Proceedings of the National Academy of Sciences 115 (art. 201710231). (31 May 2018; https://doi.org/10.1073/pnas.1710231115)

Downey L, Pennington D. 2009. Bridging the gap between technology and science with examples from ecology and biodiversity. Biodiversity Informatics 6: 18–27.

Dressler F, Ripperger S, Hierold M, Nowak T, Eibel C, Cassens B, Mayer F, Meyer-Wegener K, Kolpin A. 2016. From radio telemetry to ultra-low-power sensor networks: Tracking bats in the wild. IEEE Communications Magazine 54: 129–135.

Elith J, et al. 2006. Novel methods improve prediction of species’ distribu-tions from occurrence data. Ecography 29: 129–151.

Emmanuel K. 2017. Assessing the present and future probability of Hurricane Harvey’s rainfall. Proceedings of the National Academy of Sciences 114: 12681–612684.

Engemann K, et al. 2016. Patterns and drivers of plant functional group dominance across the Western Hemisphere: A macroecological re-assessment based on a massive botanical dataset. Botanical Journal of the Linnean Society 180: 141–160.

Farina A. 2014. Soundscape Ecology: Principles, Patterns, Methods, and Applications. Springer.

Felton AJ, Smith MD. 2017. Plant ecological responses to climate extremes from the individual to the ecosystem level. Proceedings of the Royal Society B 372 (art. 20160142).

Firbank LG, et al. 2017. Towards the co-ordination of terrestrial ecosys-tem protocols across European research infrastructures. Ecology and Evolution 7: 3967–3975.

Fischer J, Tuecke S, Foster I, Stewart CA. 2015. Jetstream: A distributed cloud infrastructure for underresourced higher education communities. Pages 23–34 in Jha S, Katz DS, Weissman J, eds. Proceedings of the First Workshop on the Science of Cyberinfrastructure: Research, Experience, Applications, and Models. Association for Computing Machinery.

Foster I. 2005. Service-oriented science. Science 308: 814–817.Foster I, Chard K, Tuecke S. 2016. The Discovery Cloud: Accelerating and

democratizing research on a global scale. Pages 68–77. Institute of Electrical and Electronics Engineers (IEEE) International Conference on Cloud Engineering (IC2E). IEEE.

Freifeld CC, Mandl KD, Reis BY, Brownstein JS. 2008. HealthMap: Global infectious disease monitoring through automated classification and visualization of Internet media reports. Journal of the American Medical Informatics Association 15: 150–157.

Gomez Villa A, Salazar A, Vargas F. 2017. Towards automatic wild animal monitoring: Identification of animal species in camera-trap images using very deep convolutional neural networks. Ecological Informatics 41: 24–32.

Goring S, Dawson A, Simpson G, Ram K, Graham RW, Grimm EC, Williams JW. 2015. neotoma: A programmatic interface to the Neotoma Paleoecological Database. Open Quaternary 1: 1–17.

Gray AN, Brandeis TJ, Shaw JD, McWilliams WH, Miles PD. 2012. Forest inventory and analysis database of the United States of America (FIA). Biodiversity and Ecology 4: 225–231.

Grossman RL, Heath A, Murphy M, Patterson M, Wells W. 2016. A case for data commons: Toward data science as a service. Computing in Science and Engineering 18: 10–20.

Guralnick RP, Hill AW, Lane M. 2007. Towards a collaborative, global infrastructure for biodiversity assessment. Ecology Letters 10: 663–672.

Hahn AS, Konwar KM, Louca S, Hanson NW, Hallam SJ. 2016. The infor-mation science of microbial ecology. Current Opinion in Microbiology 31: 209–216.

Hamel S, Killengreen ST, Henden J-A, Eide NE, Roed-Eriksen L, Ims RA, Yoccoz NG. 2013. Towards good practice guidance in using camera-traps in ecology: Influence of sampling design on validity of ecological inferences. Methods in Ecology and Evolution 4: 105–113.

Hampton SE, Parker JN. 2011. Collaboration and productivity in scientific synthesis. BioScience 61: 900–910.

Hampton SE, Strasser CA, Tewksbury JJ, Gram WK, Budden AE, Batcheller AL, Duke CS, Porter JH. 2013. Big data and the future of ecology. Frontiers in Ecology and the Environment 11: 156–162.

Hansen MC, et al. 2013. High-resolution global maps of 21st-century forest cover change. Science 342: 850–853.

Harpole WS, et al. 2016. Addition of multiple limiting resources reduces grassland diversity. Nature 537: 93.

Hassan Q. 2011. Demystifying cloud computing. Journal of Defense Software Engineering 1: 16–21.

Heidorn PB. 2008. Shedding light on the dark data in the long tail of sci-ence. Library Trends 57: 280–299.

Hochachka WM, Fink D, Hutchinson RA, Sheldon D, Wong W-K, Kelling S. 2012. Data-intensive science applied to broad-scale citizen science. Trends in Ecology and Evolution 27: 130–137.

Hoffman MD, Gelman A. 2014. The No-U-Turn Sampler: Adaptively set-ting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research 15: 1593–1623.

Hooten MB, Ross BE, Wikle CK. 2012. Optimal spatio-temporal monitoring designs for characterizing population trends. Pages 443–459 in Gitzen RA, Millspaugh JJ, Cooper AB, Licht DS, eds. Design and Analysis of Long-term Ecological Monitoring Studies. Cambridge University Press.

Howe D, et al. 2008. Big data: The future of biocuration. Nature 455: 47–50.Huang Q. 2012. Adaptive Nested Models and Cloud Computing for

Scientific Simulation: A Case Study Using Dust Storm Forecasting. Lambert Academic.

Hyder K, Wright S, Kirby M, Brant J. 2017. The role of citizen science in monitoring small-scale pollution events. Marine Pollution Bulletin 120: 51–57.

Jones MB, Schildhauer MP, Reichman OJ, Bowers S. 2006. The new bioin-formatics: Integrating ecological data from the gene to the biosphere. Annual Review of Ecology, Evolution, and Systematics 37: 519–544.

Joppa LN. 2017. AI for Earth. Nature 552: 325–328.Kattge J, et al. 2011. TRY: A global database of plant traits. Global Change

Biology 17: 2905–2935.Kluyver T, et al. 2016. Jupyter Notebooks: A publishing format for repro-

ducible computational workflows. Pages 87–90 in Loizides F, Schmidt B, eds. Positioning and Power in Academic Publishing: Players, Agents, and Agendas: Proceedings of the 20th International Conference on Electronic Publishing. IOS Press.

Kosmala M, Wiggins A, Swanson A, Simmons B. 2016. Assessing data quality in citizen science. Frontiers in Ecology and the Environment 14: 551–560.

LaDeau SL, Han BA, Rosi-Marshall EJ, Weathers KC. 2017. The next decade of big data in ecosystem science. Ecosystems 20: 274–283.

Lamanna C, et al. 2014. Functional trait space and the latitudinal diver-sity gradient. Proceedings of the National Academy of Sciences 111: 13745–13750.

LeBauer D, Kooper R, Mulrooney P, Rohde S, Wang D, Long SP, Dietze MC. 2017. BETYdb: A yield, trait, and ecosystem service database applied

Dow



Overview Articles


to second-generation bioenergy feedstock production. Global Change Biology Bioenergy 10: 61–71. doi:10.1111/gcbb.12420

Lehnert K, Hsu L. 2015. The new paradigm of data publication. Elements 11: 368–369.

Lenhardt WC, Conway M, Scott E, Blanton B, Krishnamurthy A, Hadzikadic M, Vouk M, Wilson A. 2016. Cross-institutional research cyberinfrastructure for data intensive science. Pages 1–6 in Institute of Electrical and Electronics Engineers (IEEE). High Performance Extreme Computing Conference (HPEC). IEEE.

Lowndes JSS, Best BD, Scarborough C, Afflerbach JC, Frazier MR, O’Hara CC, Jiang N, Halpern BS. 2017. Our path to better science in less time using open data science tools. Nature Ecology and Evolution 1 (art. 0160).

Lynch C. 2008. Big data: How do your data grow? Nature 455: 28–29.Ma X. 2017. Linked geoscience data in practice: Where W3C standards

meet domain knowledge, data visualization, OGC standards. Earth Science Informatics 10: 1–13.

Madin JS, Bowers S, Schildhauer MP, Jones MB. 2008. Advancing ecological research with ontologies. Trends in Ecology and Evolution 23: 159–168.

Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH. 2011. Big data: The next frontier for innovation, competition, and produc-tivity. McKinsey. (3 May 2018; www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation)

Mell P, Grance T. 2011. The NIST Definition of Cloud Computing. National Institute of Standards and Technology. Report no. 800-145.

Merchant N, Lyons E, Goff S, Vaughn M, Ware D, Micklos D, Antin P. 2016. The iPlant collaborative: Cyberinfrastructure for enabling data to dis-covery for the life sciences. PLOS Biology 14 (art. e1002342).

Meyer C, Weigelt P, Kreft H. 2016. Multidimensional biases, gaps, uncertain-ties in global plant occurrence information. Ecology Letters 19: 992–1006.

Michener WK, Jones MB. 2012. Ecoinformatics: Supporting ecology as a data-intensive science. Trends in Ecology and Evolution 27: 85–92.

Michener WK, Brunt JW, Helly JJ, Kirchner TB, Stafford SG. 1997. Nongeospatial metadata for the ecological sciences. Ecological Applications 7: 330–342.

Miller RB. 1968. Response time in man–computer conversational transactions. Proceedings of the AFIPS Fall Joint Computer Conference 33: 267–277.

Mislan KAS, Heer JM, White EP. 2016. Elevating the status of code in ecol-ogy. Trends in Ecology and Evolution 31: 4–7.

Moritz C, Agudo R. 2013. The future of species under climate change: Resilience or decline? Science 341: 504–508.

[NASA] National Aeronautics and Space Administration. 2017. MCD12Q2 Product Information. NASA. (28 August 2017; https://ladsweb.modaps.eosdis.nasa.gov/missions-and-measurements/products/land-cover-and-phenology/MCD12Q2)

Nasahara KN, Nagai S. 2015. Development of an in situ observation net-work for terrestrial ecological remote sensing: The Phenological Eyes Network (PEN). Ecological Research 30: 211–223.

Norouzzadeh MS, Nguyen A, Kosmala M, Swanson A, Packer C, Clune J. 2017. Automatically identifying wild animals in camera trap images with deep learning. arXiv, Cornell University Library. (31 May 2018; https://arxiv.org/abs/1703.05830)

Novick KA, Biederman JA, Desai AR, Litvak ME, Moore DJP, Scott RL, Torn MS. 2018. The AmeriFlux network: A coalition of the willing. Agricultural and Forest Meteorology 249: 444–456.

Oberbauer SF, et al. 2013. Phenological response of tundra plants to background climate variation tested using the International Tundra Experiment. Philosophical Transactions of the Royal Society B: 368 (art. 20120481).

Olden JD, Lawler JJ, Poff NL. 2008. Machine learning methods without tears: A primer for ecologists. Quarterly Review of Biology 83: 171–193.

Olofson V, Eastwood M. 2014. Big Data: What It Is, and Why You Should Care. International Data Corporation. White Paper no. 12.

Pampel H, Vierkant P, Scholze F, Bertelmann R, Kindling M, Klump J, Goebelbecker H-J, Gundlach J, Schirmbacher P, Dierolf U. 2013.

Making research data repositories visible: The re3data.org registry. PLOS ONE 8 (art. e78080).

Peters DP, Haystad KM, Cushing J, Tweedie C, Fuentes O, Villanueva-Rosales N. 2014. Harnessing the power of big data: Infusing the scien-tific method with machine learning to transform ecology. Ecosphere 5: 1–15.

Pollard A, Castillo L, Danalia L, Glauser M, eds. 2016. Whither Turbulence and Big Data in the 21st Century? Springer.

Porter J, et al. 2005. Wireless sensor networks for ecology. BioScience 55: 561–572.

Porter JH, Nagy E, Kratz TK, Hanson P, Collins SL, Arzberger P. 2009. New eyes on the world: Advanced sensors for ecology. BioScience 59: 385–397.

Richardson AD, Jenkins JP, Braswell BH, Hollinger DY, Ollinger SV, Smith M-L. 2007. Use of digital webcam images to track spring green-up in a deciduous broadleaf forest. Oecologia 152: 323–334.

Ripperger S, Josic D, Hierold M, Koelpin A, Weigel R, Hartmann M, Page R, Mayer F. 2016. Automated proximity sensing in small vertebrates: Design of miniaturized sensor nodes and first field tests in bats. Ecology and Evolution 6: 2179–2189.

Rue H, Riebler A, Sørbye SH, Illian JB, Simpson DP, Lindgren FK. 2017. Bayesian computing with INLA: A review. Annual Review of Statistics and Its Application 4: 395–421.

Sakaki T, Okazaki M, Matsuo Y. 2013. Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Transactions on Knowledge and Data Engineering 25: 919–931.

Sauer JR, Pardieck KL, Ziolkowski DJ Jr, Smith AC, Hudson M-AR, Rodriguez V, Berlanga H, Niven DK, Link WA. 2017. The first 50 years of the North American Breeding Bird Survey. Condor 119: 576–593.

Schadhauser M, Robert J, Heuberger A. 2016. Concept for an adaptive low power wide area (LPWA) bat communication network. Pages 1–9 in Proceedings of Smart SysTech 2016, the European Conference on Smart Objects, Systems, and Technologies. VDE.

Schatz MC, Langmead B, Salzberg SL. 2010. Cloud computing and the DNA data race. Nature Biotechnology 28: 691–693.

Schimel DS, Asner GP, Moorcroft P. 2013. Observing changing eco-logical diversity in the Anthropocene. Frontiers in Ecology and the Environment 11: 129–137.

Schimel D, Pavlick R, Fisher JB, Asner GP, Saatchi S, Townsend P, Miller C, Frankenberg C, Hibbard K, Cox P. 2015. Observing terrestrial eco-systems and the carbon cycle from space. Global Change Biology 21: 1762–1776.

Schmidt D, Chen W-C, Matheson MA, Ostrouchov G. 2017. Programming with BIG data in R: Scaling analytics from one to thousands of nodes. Big Data Research 8: 1–11.

Schnase JL, Duffy DQ, Tamkin GS, Nadeau D, Thompson JH, Grieg CM, McInerney MA, Webster WP. 2017. MERRA analytic services: Meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service. Computers, Environment, Urban Systems 61: 198–211.

Servilla M, Brunt J, Costa D, McGann J, Waide R. 2016. The contribu-tion, reuse of LTER data in the Provenance Aware Synthesis Tracking Architecture (PASTA) data repository. Ecological Informatics 36: 247–258.

Skytland N. 2012. What is NASA doing with big data today? NASA. (31 May 2018; https://open.nasa.gov/blog/what-is-nasa-doing-with-big-data-today)

Smith M, Broniatowski DA, Paul MJ, Dredze M. 2016. Towards real-time measurement of public epidemic awareness: Monitoring influenza awareness through Twitter. Paper presented at the Association for the Advancement of Artificial Intelligence (AAAI) Spring Symposium on Observational Studies through Social Media and Other Human-Generated Content; 21–23 March 2016, Stanford, California.

Snijders C, Matzat U, Reips U-D. 2012. “Big Data”: Big gaps of knowledge in the field of Internet science. International Journal of Internet Science 7: 1–5.

Dow



Overview Articles


Soberón J, Peterson T. 2004. Biodiversity informatics: Managing, apply-ing primary biodiversity data. Philosophical Transactions of the Royal Society B 359: 689–698.

Soranno PA, Cheruvelil KS, Elliott KC, Montgomery GM. 2014. It’s good to share: Why environmental scientists’ ethics are out of date. BioScience 65: 69–73.

Stein LD. 2010. The case for cloud computing in genome informatics. Genome Biology 11 (art. 207).

Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE. 2015. Big data: Astronomical or genomical? PLOS Biology 13 (art. e1002195).

Stribling JB, Pavlik KL, Holdsworth SM, Leppo EW. 2008. Data quality, performance, and uncertainty in taxonomic identification for biological assessments. Journal of the North American Benthological Society 27: 906–919.

Sullivan BL, Wood CL, Iliff MJ, Bonney RE, Fink D, Kelling S. 2009. eBird: A citizen-based bird observation network in the biological sciences. Biological Conservation 142: 2282–2292.

Teal TK, Cranston KA, Lapp H, White E, Wilson G, Ram K, Pawlik A. 2015. Data carpentry: Workshops to increase data literacy for researchers. International Journal of Data Curation 10: 135–143.

Teitzel G. 2014. Harnessing the power of omics in microbiology. Trends in Microbiology 22: 227–228.

Theodoridis S. 2015. Machine Learning: A Bayesian and Optimization Perspective. Academic Press.

Towns J, et al. 2014. XSEDE: Accelerating scientific discovery. Computing in Science and Engineering 16: 62–74.

Tulloch AIT, Possingham HP, Joseph LN, Szabo J, Martin TG. 2013. Realising the full potential of citizen science monitoring programs. Biological Conservation 165: 128–138.

[USGS] US Geological Service. 2017. National Water Information System Data Available on the World Wide Web (USGS Water Data for the Nation). USGS. (28 August 2017; https://waterdata.usgs.gov/nwis)

Van Oldenborgh GJ, van der Wiel K, Sebastian A, Singh R, Arrighi J, Otto F, Haustein K, Li S, Vecchi G, Cullen H. 2017. Attribution of extreme rainfall from Hurricane Harvey, August 2017. Environmental Research Letters 12 (art. 124009).

Walther M, Kaisser M. 2013. Geo-spatial event detection in the Twitter stream. Pages 356–367 in Serdyukov P, Braslavski P, Kuznetsov SO, Kamps J, Rüger S, Agichtein E, Segalovich I, Yilmaz E, eds. Advances in Information Retrieval. Springer.

Ward JS, Barker A. 2013. Undefined by data: A survey of big data defini-tions. arXiv, Cornell University Library. (31 May 2018; https://arxiv.org/abs/1309.5821)

Wilkinson MD, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3 (art. 160018).

Williams JW, et al. 2018. The Neotoma Paleoecology Database: A multi-proxy, international community-curated data resource. Quaternary Research 89: 156–177.

Williams PJ, Hooten MB. 2016. Combining statistical inference and deci-sions in ecology. Ecological Applications 26: 1930–1942.

Wilson G. 2014. Software carpentry: Lessons learned. F1000Research 3 (art. 64).

Wing MG, Eklund A, Kellogg LD. 2005. Consumer-grade global positioning system (GPS) accuracy and reliability. Journal of Forestry 103: 169–173.

Xu J. 2006. Microbial ecology in the age of genomics and metagenomics: Concepts, tools, and recent advances. Molecular Ecology 15: 1713–1731.

Yang C, Huang Q. 2013. Spatial Cloud Computing: A Practical Approach. CRC Press.

Yang C, Raskin R, Goodchild M, Gahegan M. 2010. Geospatial cyberin-frastructure: Past, present, future. Computers, Environment and Urban Systems 34: 264–277.

Scott Farley recently completed his MSc in Geography at the University of Wisconsin-Madison and specializes in geovisualization, scientific data ser-vices, and cloud computing. Dr. Andria Dawson is a mathematical and statis-tical ecologist at Mount Royal University interested in developing and applying statistical methods to ecological data to infer ecosystem change. Simon Goring (http://goring.org) is a data scientist and paleoecologist at the University of Wisconsin-Madison serving as the IT lead for the Neotoma Paleoecology Database and on the EarthCube (http://earthcube.org) Leadership Council. John (Jack) Williams is a paleoecologist, biogeographer, and earth-system sci-entist at the University of Wisconsin-Madison studying the responses of species and communities to past and present environmental change.

Dow



Situating Ecology as a Big-Data Science - Oxford Academic

Documents

Transcript of Situating Ecology as a Big-Data Science - Oxford Academic