Accelerating wildfire susceptibility mapping through GPGPU

12
J. Parallel Distrib. Comput. 73 (2013) 1183–1194 Contents lists available at SciVerse ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Accelerating wildfire susceptibility mapping through GPGPU Salvatore Di Gregorio a , Giuseppe Filippone a , William Spataro a , Giuseppe A. Trunfio b,a Department of Mathematics and Computer Science, University of Calabria, Rende, Italy b Department of Architecture, Planning and Design, University of Sassari, P.zza Duomo, 6, 07041 Alghero (SS), Italy highlights We illustrate several GPGPU strategies for building Burn Probability Maps. Using three recent GPUs, we discuss an application example on a real area. We show how GPGPU allows to map the wildfire hazard level in low computing time. Some of the proposed approaches achieved a significant parallel speedup. article info Article history: Received 22 September 2012 Received in revised form 29 January 2013 Accepted 22 March 2013 Available online 29 March 2013 Keywords: GPGPU Cellular automata Wildfire simulation Wildfire susceptibility Hazard maps abstract In the field of wildfire risk management the so-called burn probability maps (BPMs) are increasingly used with the aim of estimating the probability of each point of a landscape to be burned under certain environmental conditions. Such BPMs are usually computed through the explicit simulation of thousands of fires using fast and accurate models. However, even adopting the most optimized algorithms, the building of simulation-based BPMs for large areas results in a highly intensive computational process that makes mandatory the use of high performance computing. In this paper, General-Purpose Computation with Graphics Processing Units (GPGPU) is applied, in conjunction with a wildfire simulation model based on the Cellular Automata approach, to the process of BPM building. Using three different GPGPU devices, the paper illustrates several implementation strategies to speedup the overall mapping process and discusses some numerical results obtained on a real landscape. © 2013 Elsevier Inc. All rights reserved. 1. Introduction Among the most recent and useful tools to support wildfire haz- ard management are the so-called burn probability maps (BPMs), which provide an estimate of the probability of a point in a land- scape to be burned under certain environmental conditions. Since the many factors that influence fire behaviour interact nonlin- early to determine the hazard level, models for simulating wildfire spread are increasingly being used to build BPMs [24,14,1]. In par- ticular, the typical approach is based on a Monte Carlo procedure in which a high number of simulations is carried out, under differ- ent weather scenarios and ignition locations [14]. In order to ob- tain reliable results in reasonable time, such an approach must be based on fast and accurate simulation models operating on high- quality high-resolution remote sensing data (e.g. Digital Elevation Models, vegetation description, etc.). Among the different wildfire simulation techniques [42], those based on Cellular Automata (CA) Corresponding author. E-mail addresses: [email protected] (S. Di Gregorio), [email protected] (G. Filippone), [email protected] (W. Spataro), [email protected] (G.A. Trunfio). [31,32,44,46,35] represent an ideal approach to build a BPM. This is because they provide accurate results and can often perform the same simulations in a fraction of the run time taken by different methods [35]. However, because of the required high number of explicit fire propagations, even using the most optimized algorithm, the build- ing of simulation-based BPMs often results in a highly intensive computational process. This is particularly true when the BPM in- volves a very large area, for which determining a high-resolution BPM can be often infeasible using standard computing facilities. In the latest years, while the computational needs of such so- phisticated risk-assessment procedures have been increasing, the same happened to the availability of high-performance comput- ers. In particular, the General-Purpose computing on Graphics Pro- cessing Units (GPGPU) has recently emerged, in which multicore Graphics Processing Units (GPU) perform computations tradition- ally carried out by the CPU. In this paper, GPGPU is applied to the process of BPM compu- tation in conjunction with a CA-based wildfire simulation model. The main contributions of this study consists of discussing several parallelization strategies for building BPMs, which correspond to different performances in terms of computing time. In particular, 0743-7315/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jpdc.2013.03.014

Transcript of Accelerating wildfire susceptibility mapping through GPGPU

J. Parallel Distrib. Comput. 73 (2013) 1183–1194

Contents lists available at SciVerse ScienceDirect

J. Parallel Distrib. Comput.

journal homepage: www.elsevier.com/locate/jpdc

Accelerating wildfire susceptibility mapping through GPGPUSalvatore Di Gregorio a, Giuseppe Filippone a, William Spataro a, Giuseppe A. Trunfio b,∗

a Department of Mathematics and Computer Science, University of Calabria, Rende, Italyb Department of Architecture, Planning and Design, University of Sassari, P.zza Duomo, 6, 07041 Alghero (SS), Italy

h i g h l i g h t s

• We illustrate several GPGPU strategies for building Burn Probability Maps.• Using three recent GPUs, we discuss an application example on a real area.• We show how GPGPU allows to map the wildfire hazard level in low computing time.• Some of the proposed approaches achieved a significant parallel speedup.

a r t i c l e i n f o

Article history:Received 22 September 2012Received in revised form29 January 2013Accepted 22 March 2013Available online 29 March 2013

Keywords:GPGPUCellular automataWildfire simulationWildfire susceptibilityHazard maps

a b s t r a c t

In the field of wildfire risk management the so-called burn probability maps (BPMs) are increasinglyused with the aim of estimating the probability of each point of a landscape to be burned under certainenvironmental conditions. Such BPMs are usually computed through the explicit simulation of thousandsof fires using fast and accurate models. However, even adopting the most optimized algorithms, thebuilding of simulation-based BPMs for large areas results in a highly intensive computational process thatmakes mandatory the use of high performance computing. In this paper, General-Purpose Computationwith Graphics Processing Units (GPGPU) is applied, in conjunction with a wildfire simulation modelbased on the Cellular Automata approach, to the process of BPM building. Using three different GPGPUdevices, the paper illustrates several implementation strategies to speedup the overall mapping processand discusses some numerical results obtained on a real landscape.

© 2013 Elsevier Inc. All rights reserved.

1. Introduction

Among themost recent and useful tools to supportwildfire haz-ard management are the so-called burn probability maps (BPMs),which provide an estimate of the probability of a point in a land-scape to be burned under certain environmental conditions. Sincethe many factors that influence fire behaviour interact nonlin-early to determine the hazard level, models for simulating wildfirespread are increasingly being used to build BPMs [24,14,1]. In par-ticular, the typical approach is based on a Monte Carlo procedurein which a high number of simulations is carried out, under differ-ent weather scenarios and ignition locations [14]. In order to ob-tain reliable results in reasonable time, such an approach must bebased on fast and accurate simulation models operating on high-quality high-resolution remote sensing data (e.g. Digital ElevationModels, vegetation description, etc.). Among the different wildfiresimulation techniques [42], those based on Cellular Automata (CA)

∗ Corresponding author.E-mail addresses: [email protected] (S. Di Gregorio), [email protected]

(G. Filippone), [email protected] (W. Spataro), [email protected] (G.A. Trunfio).

0743-7315/$ – see front matter© 2013 Elsevier Inc. All rights reserved.http://dx.doi.org/10.1016/j.jpdc.2013.03.014

[31,32,44,46,35] represent an ideal approach to build a BPM. Thisis because they provide accurate results and can often perform thesame simulations in a fraction of the run time taken by differentmethods [35].

However, because of the required high number of explicit firepropagations, even using the most optimized algorithm, the build-ing of simulation-based BPMs often results in a highly intensivecomputational process. This is particularly true when the BPM in-volves a very large area, for which determining a high-resolutionBPM can be often infeasible using standard computing facilities.

In the latest years, while the computational needs of such so-phisticated risk-assessment procedures have been increasing, thesame happened to the availability of high-performance comput-ers. In particular, theGeneral-Purpose computing onGraphics Pro-cessing Units (GPGPU) has recently emerged, in which multicoreGraphics Processing Units (GPU) perform computations tradition-ally carried out by the CPU.

In this paper, GPGPU is applied to the process of BPM compu-tation in conjunction with a CA-based wildfire simulation model.The main contributions of this study consists of discussing severalparallelization strategies for building BPMs, which correspond todifferent performances in terms of computing time. In particular,

1184 S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194

Fig. 1. Growth of the ellipse γ locally representing the fire front. The symbol ρdenotes the forward spread which is incremented by∆ρ at the ith time step.

the proposed approaches are empirically investigated using threeGPGPU devices and a test case concerning a real Mediterraneanlandscape. More in general, the article provides some useful guide-lines for the simultaneous simulation of a large number of wild-fires using GPGPU. Besides the construction of BPMs, the results ofthis study may also be useful for carrying out extensive sensitivityanalysis to model parameters or for the fast optimization of risk-mitigation interventions [2] and fire-fighting activities [29].

The paper is organized as follows. The next section outlinesthe main characteristics of the adopted CA simulation modeland illustrates some details of the typical approach for BPMcomputation. Then, in Section 3 some relevant elements of theadopted GPGPU approach are presented. In Section 4 the proposedparallel approaches are described and some of their computationalcharacteristics empirically investigated. The paper ends withSection 5 in which the main findings are summarized and futurework is outlined.

2. A CA for wildfire simulation and risk assessment

A classical homogeneous CA can be viewed as a uniform latticeof cells representing identical finite automata, hence endowedwith a state and a transition function. Each cell takes as input itsstate and the states of a set of neighbouring cells defined by ageometrical pattern which is invariant in time and space. Startingfroman initial condition, the CA evolves activating on a step by stepbasis all the identical transition functions simultaneously.

In order to effectively use the CA dynamics for the simulationof spatially complex systems, many extensions of its original defi-nition have been proposed in literature. Typically, the classical CAparadigm was modified to overcome some implicit limits, such ashaving few states, look-up table transition functions and invariantneighbourhoods [21,11,43,12,13]. The model described in the fol-lowing, as well as the used formalism, are based on one of such ex-tended CA notions, namely on the Macroscopic Cellular Automataapproach introduced in [21] and already exploited for the simula-tion of many macroscopic phenomena [22,44,7,38,41,20].

As in most wildfire spread simulators [45,32,25,29,35], theapproach adopted in this study is based on Rothermel’s firemodel [39,40], which provides the heading rate and direction ofspread given the local landscape and wind characteristics. An ad-ditional constituent is the commonly assumed elliptical descrip-tion of the spread under homogeneous conditions (i.e. spatially andtemporally constant fuels, wind and topography) [4].

In order to mitigate the accuracy problems that affect mostraster-based wildfire simulators [27,25,30,35], the CA adoptedin this paper extends the size of the commonly used Moore’sneighbourhood and adopts an initial randomization of the spreaddirections. As shown in [8], such an approach, which does not alterthe deterministic nature of the model, provides relevant beneficialeffects on the overall accuracy.

In more detail, the two-dimensional fire propagation is locallyobtained by a growing ellipse having the semi-major axis along

the direction of maximum spread, the eccentricity related tothe intensity of the so-called effective wind and one focus actingas a ‘fire source’ [44,35]. At each CA step, the ellipse’s size isincreased according to both the duration of the time step andmaximum rate of spread (see Fig. 1). Afterwards, a neighbouringcell invaded by the growing ellipse is considered as a candidate tobe ignited by the spreading fire. In case of ignition, a new ellipseis generated according to the amount of overlapping between theinvading ellipse and the ignited cell. Formally, the model is a two-dimensional CA with square cells defined as:CA = ⟨K,Q,N , S,P , η, ψ, φ⟩ (1)where:– K is the set of points in the finite regionwhere thephenomenon

evolves. Each point represents the centre of a square cell;– Q is a set of randomized local sources (RLSs) [8], one point for

each cell; they are randomly generated at the beginning of thesimulation within an assigned small radius from each of thecentres in K . As detailed later, a new ignition in a cell consistsof a new ellipse having its rear focus on the local source q ∈ Q;

– N is the set that identifies the pattern of cells influencing thecell state change (i.e. the neighbourhood);

– S is the finite set of the states of the cell, defined as the Cartesianproduct of the sets of all the cell’s substates;

– P is the finite set of global parameters, including those thatdefine the fuel bed characteristics according to the standard fuelmodels used in BEHAVE [6];

– η : S|N | → S is the transition function accounting for the fireignition, spread and extinction mechanisms;

– ψ : S|K| → R is a function that determines the size ∆t ofeach time step according to both the digital terrain model andmaximum spread rate among the cells on the current fire front.The value of ∆t is then used by the function φ : R → R, forkeeping the current time t up to date.

The cell’s substates include all the local quantities used by thetransition function for modelling the local interactions betweenthe cells (i.e. the fire propagation to neighbouring cells) as wellas its internal dynamics (i.e. the fire ignition and growth). Inparticular, among the substates that define the state of each cell,there are:– the altitude z ∈ R of the cell;– the fuel model µ ∈ N, which is an index referring to one of the

mentioned standard models that specify the characteristics ofvegetation relevant to Rothermel’s equations;

– the combustion state σ ∈ Sσ , which takes one of the values‘unburnable’, ‘burnable’, ‘ignited’ and ‘burnt ’.

– the accumulated forward spread ρ ∈ R≥0, that is the currentdistance between the focus f of the local ellipse and the farthestpoint on the semi-major axis (see Fig. 1);

– the angle θ ∈ R (see Fig. 1), giving the direction of the maxi-mum rate of spread, obtained through the composition of twovectors, namely the so-called wind effect and slope effect [39];

– the maximum rate of spread r ∈ R≥0, also provided by Rother-mel’s equations on the basis of the relevant local characteris-tics [39];

– the eccentricity ε ∈ [0, 1] of the ellipse γ representing the localfire front, which is obtained as a function of both the wind andterrain slope through the empirical relation proposed in [5,25].

In brief, the scheduling of each CA step is organized as follows:1. first, the global functionψ computes the current duration of the

time step∆t and the function φ updates the current time t;2. afterwards, the transition function η is executed for each cell

of the automaton. This involves spreading the fire to theneighbouring cells, during the time interval∆t , according to thealgorithm described below;

3. finally, if t is less than a final time tf a new step is executed,otherwise the simulation ends.

S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194 1185

Fig. 2. The ith neighbouring cell intersected by the ellipse γ locally representingthe fire front.

Fig. 3. The adopted extendedneighbourhoodN composed of 25 cells togetherwithan example of RLSs inside each cell.

According to the transition function η outlined in Algorithm 1,if the cells is not in the ‘burning ’ state no further calculation isperformed. Otherwise, in case of burning cell, the first step of ηconsists of checking the condition that triggers the transition to the‘burnt ’ state. The latter is verified when none of the neighbouringcells are in the ‘burnable’ state, that is when the cell’s contributionis no longer necessary to the fire spread mechanism. Then, if thecell still belongs to the fire front, η updates the size of the localellipse γ (line 5).

The next statement of η consists of testing if the fire is spreadingtowards other cells ci of the neighbourhood that are in the‘burnable’ state (lines 6–8). Such a spread test is carried out bychecking if γ includes the RLS qi of the cell ci (see Fig. 2). If qi isinside γ , then a new ellipse γi is generated for the cell ci, having theRLS qi assumed as rear focus and ri, θi and εi computed through theproper model equations [39,25,34]. Further details of the model,including some of the relevant model equations, can be foundin [8].

Algorithm 1: Cell’s transition function η.1 if σ = ‘burning’ then2 if none of the neighbours are in the ‘burnable’ state then3 σ ← ‘burned’;4 return;5 ρ ← ρ + r ∆t;6 foreach cell ci in the neighbourhood do7 if the substate σi of ci is ‘burnable’ then8 if the cell ci is reached by γ then9 Compute ri and θi;

10 if ri > 0 then11 σi ← ‘ignited’;12 Compute εi;13 Compute the current local spread ρi;

As mentioned above, the CA model is based on the extendedMoore’s neighbourhood composed of 25 cells represented in Fig. 3.Also, the use of the RLSs inside each cell allows for obtaining a highnumber of different spread direction during the fire propagationin a landscape, thus significantly improving the accuracy of theresults [8]. However, it is important to note that since the RLSs aregenerated only once before the beginning of the simulation, theadopted CA model is deterministic.

Clearly a higher number of neighbouring cells corresponds toa higher computational cost of the CA step. Nevertheless, themodel can also be used with the standard Moore’s neighbourhoodcomposed of 9 cells, still obtaining acceptable accuracies thanks tothe adopted RSLs [8].

As stated before, the size ∆t of each time-step is determinedon a non-local basis by the global function ψ . In general, largevalues of ∆t speed up the simulation because a lower numberof steps are required for reaching the final time. However, smallvalues of∆t help to decrease the accumulation of errors during thespreadmechanism. Therefore, themain aims ofψ are: (i) to ensurethat at least a new ignition will take place during the next step;(ii) to guarantee that the ellipse from the cell having the highestrate of spread does not go beyond any of its neighbouring cells. Inpractice, at each CA step the function ψ operates as follows. Foreach burning cell c (i.e. belonging to the current fire front) and foreach of its neighbours ci in the ‘burnable’ state, the time requiredby the fire to reach the RLS qi of ci is computed as:

∆ti =|qi − q| − ρi

r(2)

where q and r are the RLS and the rate of spread of c, respectively,and ρi is the current spread along the vector qi − q. Then, togenerate at least an ignition at the next step, the value of ∆tcan be defined as the minimum among all the ∆ti defined by Eq.(2) [32,35]. However, the model also allows slightly higher valuesso that to speed up the simulationswith a low impact on the overallaccuracy.

Note that, to account for a sloping terrain, in Eq. (2) as well as inAlgorithm 1 the altitude of each cell must be considered (i.e. whencomputing distances, all the involved vectors must be viewed inthe three-dimensional Cartesian space).

Many important optimizations of the procedure outlined abovecan be implemented. The most typical consists of using a suitabledynamic data structure that allows to operate only on the cellsbelonging to the current fire front (i.e. for which σ = ‘burning ’).Moreover, in case of repeated simulations with stationary weatherconditions and different points of ignition, many substates couldbe pre-computed once (e.g. lines 9 and 12 of the Algorithm 1).Other optimizations could regard the memory usage: for example,it is easy to combine σ and ρ into a single substate. Nevertheless,as shown later in the paper, the same enhancements can beeffectively adopted also in the parallel implementation.

2.1. A simulation-based approach for building BPMs

In the latest years, the use of hazard maps based on theexplicit simulation of natural phenomena has been increasinglyinvestigated as an effective and reliable tool for supporting riskmanagement [24,38,14,1,37].

In the case ofwildfire, themost general approach for computinga BPM on a landscape [24,14,33,1,3] consists of a Monte Carlomethod inwhich a high number of different fire spread simulationsare carried out, sampling from suitable statistical distributions therandom variables relevant to the fire behaviour. In the spirit ofthemethod, the simulated fires represent independent events thatmay occur in the future. At the end of the process, the local risk iscomputed on the basis of the frequency of burning.

1186 S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194

BPMs usually focus on extreme weather conditions (i.e. windand fuel moisture content), since these can lead to large fire eventsthat are harder to suppress and correspond to the highest risk. Also,the typical duration of the simulated fires is relatively small (e.g. afew hours), since under extreme conditions even small durationscan produce large burned areas. Then, another common assump-tion is that of temporally constantweather conditions, which leadsto acceptable results in case of short burning times [24,33,1,3].

The technique for computing the BPMs adopted in this studyis based on a prefixed number nf of simulation runs, where eachrun represents a single simulated fire. A regular grid of ignitionlocations is adopted, which corresponds to the assumption ofa uniform ignition probability for each point of the landscape.The adopted temporally-constant weather scenario reflects severeconditions for the area under study. Also, all the simulated fireshave the same duration, which is selected considering that ofhistorical fires.

Once the simulations have been carried out, the resulting nfmaps of burned areas are overlaid and cells’ fire frequency are usedfor the computation of the fire risk. In particular, a burn probabilitypb(c) for each cell c is computed as:

pb(c) =f (c)nf

(3)

where f (c) is the number of times the cell c is ignited during the nfsimulated fires. The burn probability for a given cell is an estimateof the likelihood that a cell will burn given a single random ignitionwithin the study area and given the assumed conditions in termsof fire duration, fuel moisture and weather.

According to the procedure described above, the number nf ofsimulation runs depend on the resolution of the grid of ignitionpoints. In general, it is not necessary to simulate a wildfire foreach of the cells in the automaton. In fact, considering the usualresolution of landscape data, ignitions on adjacent cells producevery similar fire shapes. Nevertheless, as shown in the applicationexample discussed later, the number of fire simulation needed forachieving a good BPM accuracy can be considerably high in case ofstudy areas with great extensions.

3. Parallel Computing with CUDA

A natural approach to deal with the high computational effortrelated to the construction of BPMs is the use of parallel computing.Among the different parallel architectures and computationalparadigms, in the last years GPGPU has attracted the interest ofmany researchers.

The particular GPGPU platform used in this paper is the oneprovided by nVidia, which consists of a group of StreamingMultiprocessors (SMs) in a single device. Each SM can support alimited number of co-resident concurrent threads, which share theSM’smemory resources. Furthermore, each SMconsists ofmultipleScalar Processor (SP) cores.

In order to program the GPU, this study adopts the C-languageCompute Unified Device Architecture (CUDA), a programmingmodel introduced in 2006 by nVidia Corporation for theirGPUs [17]. In a typical CUDA program, sequential host instructionsare combined with parallel GPGPU code. The idea underlyingthis approach is that the CPU organizes the computation (e.g. interms of data pre-processing), sends the data from the computermain memory to the GPU global memory and invokes theparallel computation on the GPU device. After, and/or during thecomputation, the computed results are copied back into the mainmemory for post-processing and output purposes. In some cases,the computing scheme outlined above can be enhanced includingoverlapping the CPU and GPU computation as well as overlappingmemory copying with computation [17,18].

In CUDA, the GPU activation is obtained by writing devicefunctions in C language, which are called kernels. When a kernel

is issued by the CPU, a number of threads (e.g. typically severalthousands) execute the kernel code in parallel on different data.

According to the nVidia approach to GPGPU, threads aregrouped into blocks, each executed on a single SM. The lattercan handle multiple blocks in a time-sliced fashion. In particular,threads in a block are scheduled in groups called warps. Allthreads in a warp simultaneously execute the same instruction.However, in case of conditional branches threads can take differentpaths. In such a scenario of divergent threads, the instructions onthe taken branches are executed serially. Hence, to improve thecomputational performance one of the key factors in CUDAconsistsof minimizing the occurrence of branch divergence [17].

From a programmer’s point of view, it is of a certain relevanceto know that the GPU can access different types of memory. Forexample, a certain amount of fast shared memory (which canbe used for some limited intra-block communication betweenthreads) can be assigned to each thread block. Also, all threadscan access a slower but larger global memory which is on thedevice board but outside the computing chip. The device globalmemory is slower if compared with the shared memory, but it candeliver significantly (e.g. one order of magnitude) higher memorybandwidth than traditional host memory (i.e. the main computermemory). The latter is typically linked to the GPU card through arelatively slow bus. As a result, the parallel computation should beorganized in such a way as to minimize data transfers between thehost and the device.

Even though the GPU’s global memory offers a considerablyhigher memory bandwidth compared to CPU’s host memory, itstill requires hundreds of clock cycles to start the fetch of a singleelement. However, themassively threaded architecture of the GPUis used to hide such memory latencies by rapid switching betweenthreads: when a thread stalls on amemory fetch, the GPU switchesto a different thread. To fully exploit such a latency hiding strategy,it is beneficial to use a large number of blocks [17].

In the following, the concepts outlined above are appliedto different GPGPU-based parallelizations of the procedure forbuilding BPM’s described in Section 2.1.

4. GPGPU-based wildfire risk mapping

The CA approach is known as one of the most typical parallelcomputing paradigms. In fact, the whole system is composed ofa set of independent cells, which are influenced only by theirneighbours. This allows for: (i) computing the next state of all thecells in parallel; (ii) accessing only the current neighbours’ statesduring each cell’s update, thereby giving the chance to increase theefficiency of memory accesses.

As in the sequential case, the typical CAparallel implementationinvolves twomemory regions,whichwill be called CAcur and CAnext ,representing the current and next states for the cells respectively.For each CA step, the neighbouring values from CAcur are read bythe local transition function, which performs its computation andwrites the new state value into the appropriate element of CAnext .

According to the recent literature in the field [23,9,20], inthe GPGPU parallel implementation of the CA-based procedureillustrated above most of the automaton data (i.e. both the CAcurand CAnext memory areas) was stored in the GPU global memory.This involves: (i) initializing the current state through a CPU–GPUcopy operation (i.e. from host to device global memory) before thebeginning of the simulation and (ii) retrieving the final state of theautomaton at the end of the simulation through a GPU–CPU copy(i.e. from device global memory to host memory). Also, at the endof each CA step a device-to-device memory copy operation is usedto re-initialize the CAcur values with the CAnext values.

In order to speed up the access to memory, the automatondata in device global memory should be organized in a way so

S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194 1187

Fig. 4. The landscape under study: a 18 km× 18 km area in Ligury, Italy. Land-cover classes were derived from the CORINE data.

as to allow coalescing access. To this purpose, a best practicerecommended by nVidia is to use structures of arrays rather thanarrays of structures in organizing the memory storage of cell’sproperties [18]. Thus, an array with the size corresponding to thetotal number of cells was allocated in the CPU memory for each ofthe CA substates. All such arrays were then mirrored in the GPUtogether with some additional auxiliary arrays (e.g. for storing theneighbourhood structure and the model parameters).

According to the CA transition function described in Section 2,all of the neighbouring cells have to access each and every othersubstate. As a result, a CA step involves repeated accesses to thesame portions of the global memory. In order to lower the numberof such costly operations, the faster shared memory could be ex-ploited to cache the CA substates needed to all the threads of thesame block [20]. However, as shown in [9], for the recent nVidiahardware (i.e. since the advent of the Fermi architecture) a reason-able option is the automatic exploitation of the caching levels forglobal memory access (L1 and L2). For this reason, instead of usingthe sharedmemory, in this study the parallel CA transition functionexploited the specific CUDA option for increasing the L1 cache size.

A key step in the parallelization of a sequential code for the GPUarchitecture according to the CUDA approach, consists of identi-fying all the sets of instructions (e.g. the CA transition function)that can be executed independently of each other on different ele-ments of a dataset (e.g. on the different cells of the automaton). Asmentioned in Section 3, such sequences of instructions are groupedin CUDA kernels, each transparently executed in parallel by GPUthreads operating ondifferent data (e.g. the cells of the automaton).

However, one of the issues raised by the GPGPU parallelizationobject of this study is related to the fact that, in the wholeautomaton, only the cells belonging to the current fire frontperform actual computation. Hence, launching one thread foreach of the automaton cells would result in a certain amount ofdissipation of the GPU computational power.

To copewith this and other critical aspects of the computationalproblem under study, several parallel approaches have beendeveloped. In the following, the different alternative strategiesare described in detail. Also, using a significant test case andthree recent GPU devices, their computational performances areempirically investigated.

4.1. Case study and GPU devices

The parallel algorithms for building BPMs were applied to anarea of the Ligury region (Italy) that is historically characterizedby a high frequency of serious wildfires. The landscape, shown in

Fig. 4, was modelled through a Digital Elevation Model composedof 461 × 445 square cells with a side of 40 m. In the area,the terrain is relatively complex with an altitude above sea levelranging from 0 to 250 m. The heterogeneous fuel bed data werebased on the 1:25,000 land cover map from the CORINE EU-project [15]. In particular, the land-cover codes were mappedon the standard fuel models used by the CA (i.e. the substateµ). Plausible values of fuel moisture content were obtained fromliterature data. Also, a domain-averaged open-wind vector fromthe North direction, having an intensity of 20 km h−1, was used forproducing time-constant griddedwinds throughWindNinja [26], acomputer program that simulates the effect of terrain on the windflow. A duration of 10 h was adopted for all simulated fires. Overthe area, a regular grid of 92×89 ignitionpointswas superimposed,leading to 7391 fires to simulate (ignition points on unburnableareas were automatically excluded).

The BPM obtained at the end of the procedure described above,using a CA with a neighbourhood size |N | = 25, is shown in Fig. 5.It is worth noting that the use of a standard Moore’s neighbour-hood (i.e. |N | = 9) still provides acceptable results, allowing us tosave some computing time as shown later in the paper. In partic-ular, a pixel by pixel comparison between the maps obtained withthe two different CA neighbourhoods revealed an average relativedifference of about 12% with a very similar distribution of the burnprobabilities. This is mainly due to the RLSs strategy described inSection 2, which allows to increase the variety of spread directionscompared to that imposed by the raster, still leading to a determin-istic model.

The next section describes the different parallelization strate-gies adopted for computing the BPM represented in Fig. 5. In par-ticular, three CUDA deviceswere used in the experiments: a nVidiaTesla C2075 and two nVidia Geforce graphic cards, namely the GTX580 and the GTX 680. The latter belongs to the new nVidia’s Ke-pler GPU architecturewhile the former are endowedwith the olderFermi GPUs. The Tesla C2075 was used in single precision floatingpoint and with the ECC disabled. In Table 1 some of the relevantcharacteristics of the used devices are reported. Also, in order toquantify the achieved parallel speedup, sequential versions of thesame algorithms parallelized for the GPU were run on a worksta-tion equipped with a Intel Xeon X5660 (2.80 GHz) 6-Core CPU.

4.2. The naïve implementation

As discussed above, building the BPM implies the computa-tional problem of simulating nf fires, which represent independentrandom events within a future temporal horizon.

1188 S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194

Fig. 5. The BPM obtained for the landscape under study using a CA with neigh-bourhood size |N | = 25. The map was computed through the simulation of 7391independent fires.

Table 1Some relevant characteristics of the adopted GPGPU hardware for all carried outexperiments. Note that GFLOPs refers to the theoretical peak performance in single-precision floating point operations.

GTX 580 GTX 680 Tesla C2075

SM count 16 8 14CUDA cores 512 1536 448Clock rate [MHz] 1544 1006 1150Bandwidth [GB/s] 192.4 192.3 144.0GFLOPs 1581.1 3090.4 1030.4

In the most straightforward parallel implementation, labelledasWCAM in the following, the CUDA kernels operate on the wholeautomaton. However, during a wildfire simulation only the transi-tion function of the currently burning cells do significant compu-tation. Therefore, simulating only one fire at a time would imply ahigh percentage of uselessly scheduled threads. In addition, giventhe small size of most fires, the number of active threads would betoo low to allow the GPU to effectively activate the latency hidingmechanism.

For these reasons, in the WCAM approach many fires aresimultaneously simulated, until the simulation of all the nf offires included in the uniform grid of ignitions (see Section 2.1).In other words, the main CUDA kernel iterates over a numberof independent fires which are propagated at the same CA step.Obviously, such an approach requires the use of an additional arrayfor storing the combustion state σ of each cell of the automaton foreach simultaneous fire.

Another computational advantage given by the simultaneoussimulation ofmany fires is related to the smaller number of kernelsinvocations. In fact, since launching a CUDAkernel implies a certainoverhead, increasing the amount of work performed in each kernelcall is one of the typical CUDA optimization strategies.

Before starting the BPM construction, a pre-processing sequen-tial phase takes place in which for each cell ci the maximum rateof spread ri, its direction θi and the local ellipse eccentricity εi(see Fig. 1) are computed using the proper model equations [39,5].According to the algorithm outlined in Section 2, such pre-computed quantities determine, together with the landscape to-pography, the wildfire spread at the cell-level. Moreover, duringthe pre-processing phase the maximum time-step size for eachcell is computed and stored in an array in order to speed-up the

Algorithm 2: The GPGPU procedure for building BPMs.1 nc ← 0;2 fn ← 0;3 while nc < nf do4 resetCA(CAcur );5 nc ← nc+setNewIgnitions(CAcur );6 t← 0;7 while notTerminated(t, tf ) do8 1T ← findStepSizesInParallel(CAcur );9 transitionFunctionInParallel(CAcur , CAnext , t,1T);

10 copyNextToCurrent(CAcur , CAnext );11 t← t+1T;12 updateIgnitionFrequencyInParallel(CAcur , fn);13 fn ← fn/nf ;

time-step adaptation during the CA iterations. It is worth notingthat such pre-processing makes sense because the weather condi-tions are considered stationary, which is a common assumption forcomputing BPMs as discussed in Section 2.1.

In the BPM computation phase, which is outlined in Algo-rithm 2, clusters of simultaneous fires are simulated up to coveringthe entire area under study (i.e. until all the nf fires are simulated).In particular, each cluster is composed of a block of fires originatedby spatially-contiguous ignition points. The latter are taken by thefunction setNewIgnitions from the regular grid of fire ignitions (line5 of the algorithm). Note that the simultaneous simulation of firesoriginating from contiguous points favours exploiting the cachememory that equips themodern GPUs. For each cluster of simulta-neous fires, the CA steps are iterated until the current time for eachfire (i.e. the vector t) reaches the final time tf (lines 7–11).

At the beginning of each CA step, the function findStepSizesIn-Parallel (line 8) issues the CUDA kernel for dynamically adaptingthe time-step durations1T. Since this consists of finding themini-mum of all allowed time-step sizes among the cells on the currentfire fronts, such a kernel simply implements a standard parallelreduction (PR) algorithm. The latter exploits the fast GPU sharedmemory and was implemented according to the optimized exam-ples included in the CUDA SDK. However, since the sharedmemoryis a limited resource, in general it is not possible to perform the PRsimultaneously (i.e. in the same kernel) on a high number of fires.For this reason, the time-step durations are computed separatelyinvoking for each fire the PR kernel, which operates on the wholeautomaton.

According to Algorithm 2, once the vector 1T has been com-puted, the kernel implementing the fire propagation mechanismis activated by transitionFunctionInParallel on a grid correspondingto the entire CA (line 9). Such kernel iterates over the fires that aresimultaneously propagated (see Fig. 6 for the case of two simulta-neous fires).

Subsequently, a device-to-device memory copy operation isused to re-initialize the CAcur values of the substates with thevalues of CAnext (line 10). It is worth noting that such copy is limitedto the only substate that is modified during the simulation, thatis the combustion state σ of each fire. At the end of the CA step,the current times t are updated according to the current time-stepdurations 1T (line 11).

After each group of simulations, a further CUDA kernel islaunched on the whole automaton to update an array fn in whicheach element represents the number of times that a cell has beenburned since the beginning of the process (line 12). As soon as allthe scheduled simulations have been carried out, each element offn, divided by the total number of simulations, gives an estimate ofthe burn probability for one of the cells (line 13).

The sequential version of the procedure described abovewas used for building the BPM represented in Fig. 5 with the

S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194 1189

Fig. 6. Three different ways (right) of mapping the CA transition function into a CUDA grid of threads in case of the two simultaneous fires shown on the left. In theWCAMcase, a CUDA kernel iterating over the two fires is associated to each cell of the automaton. In the CRBBM approach, the same type of kernel operates on each cell belongingto the current CRBB. According to the CRBB3D approach, a lighter kernel separately manages each fire for each cell belonging to the current CRBB.

Fig. 7. Time taken by the WCAM approach for simulating the 7391 independentfires required by the BPM shown in Fig. 5. The elapsed time is plotted as a functionof the number of fires that are simultaneously simulated by the GPU (see Algorithm2). The adopted CA neighbourhood size was |N | = 25.

workstation described in Section 4.1. Clearly, in the sequentialcase only one fire at a time was propagated since the advantagesof simulating multiple fires are not significant. The CPU took14167.4 s to simulate the 7391 independent fires required by thecase-study BPM.

Using the adopted GPU devices, the same problem was solvedwith the WCAM approach and a variable number of simultaneousfires. According to the results shown in Fig. 7, the GTX 680achieved the lowest elapsed time of 443.4 s, simultaneouslysimulating 169 fires. The gain provided by the parallelization interms of computing time was significant and corresponded toa parallel speedup of 32 over the used CPU. As can be seen,the simultaneous simulation of more fires was beneficial since itallowed a computing time decrement that ranged from 19% inthe GTX 580 to about the 31% in the GTX 680 and Tesla C2075.However, for all the GPUs simulating more than about 50 firessimultaneously did not lead to significant advantages.

To understand this behaviour, it should be noted that theincrease in the number of simultaneous fires causes conflictingeffects on the efficiency. In fact, more simultaneous fires give theopportunity to activate the latency-hiding mechanism and alsoallows to reduce the number of kernel calls, with the relatedoverhead. However, in theWCAM approach a problem is originatedby the manner in which the simultaneous fires are handled in

the kernel implementing the transition function. In fact, at thesame time step a cell is burned by only some of many fires.Therefore, iterating on the different fires generates a frequentdivergence between the threads of the same warp. As mentionedin Section 3, this can be associated with a significant decline inefficiency because the processing of different fires inside a warpis serialized [17].

For the reasons above, once the number of computationally-relevant threads is sufficient for enabling latency hiding, a furtherincrease of simultaneous fires leads to a substantial balancebetween the remaining effects, with no additional computationaladvantage.

Interestingly, in spite of their very different peak performancein single-precision floating point operations, the best elapsed timesof GTX 580 and GTX 680 were quite similar. This is due to the factthat the used kernels are definitely memory bound and that thetwo GPUs have the same memory bandwidth. In fact, especiallyconsidering thatmost substates are pre-computed once in the pre-processing phase, the CA transition function is characterized bya relatively low computational load. In addition, to apply the firespread mechanism each cell must access its own substates andmost of the substates of its neighbouring cells.

However, it is very likely that computing with the GPU alsothe fire characteristics at the cell level [39,5] at each CA step,and then abandoning thementioned pre-processing, the conditionof memory-bound kernel may change. This would allow thesimulation of non-stationary weather conditions at the price of anincreased computing time, though with higher speedups.

4.3. The rectangular bounding box strategy

4.3.1. Dynamic grid of threadsOne of the weaknesses of the WCAM approach lies in the

difficulty of having a high percentage of computationally activethreads in the CUDA grid. Some improvement might be obtainedby distributing in an appropriate way over the domain a very largenumber of simultaneous fires. However, this is made difficult bythe fact that each fire evolves in a way that is not predictable if notwith the simulation itself. Furthermore, since each independentfire requires at least an additional CA substate, the number ofsimultaneous fires is limited by the available memory.

For these reasons, an alternative approach was developed inwhich the grid of threads is dynamically computed during thesimulation in order to keep low the number of computationallyirrelevant threads. In such an approach, labelled as CRBBM, anumber of independent fires are simultaneously simulated as in

1190 S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194

the WCAM procedure. In addition, at each CA step the procedureinvolves the computation of the smallest common rectangularbounding box (CRBB) that includes any cells on the current firefronts. Then, all the kernels required by the CA step in Algorithm 2(i.e. the PR for the time step adaptation and the transition function)are mapped on such CRBB, in this way reducing the number ofuseless threads and improving significantly the computationalperformance of the algorithm.

It is worth noting that the efficiency of such an approachdepends on the actual distribution of the simultaneous firesaver the automaton. However, in Algorithm 2 the fires that aresimulated together originate from contiguous ignition points. Thisusually gives rise to CRBBs significantly smaller that the wholelandscape.

The CRBBM approach presented above is outlined in Fig. 6 forthe case of two simultaneous fires.

Since the present study exploits relatively recent GPU devices,the CRBB computation was efficiently carried out using theatomicMin and atomicMax CUDA primitives in the same kernelimplementing the transition function. As an alternative, the CRBBcan be computed through a PR procedure [19]. However, as shownlater, for such a purpose the PR approach is less convenient withGPUs that offer rather efficient atomic operations.

A further enhancement introduced in the CRBBM approachconcerns the device-to-device memory copy operation used to re-initialize CAcur with CAnext (line 9 of Algorithm 2). In particular,instead of involving the entire automaton, a specific kernel wasdeveloped to re-initialize the substates σ of each simultaneous fireonly for the cells within the CRBB.

In order to perform a fair comparison, an analogous strategybased on the bounding box has been developed for the sequentialversion of the program. Using the reference CPU, such sequentialprocedure required 2082.6 s for simulating the 7391 fires requiredby the BPM adopted as case study in Section 4.1.

Fig. 8 shows the corresponding times taken by the parallelCRBBM approach as a function of the number of simultaneous fires.According to the graph, the lowest computing times provided bythe GTX 680 and GTX 580 were very similar. In particular, the GTX680 achieved the lowest elapsed time of 99.4 s, corresponding to aparallel speedup of 21. As expected, the simultaneous simulationof many fires was much more beneficial than in the WCAM case.For example, with the GTX cards the best elapsed time is lessthan a half of that obtained by simulating only one fire at a time.The reason is that, having available a relatively limited numberof simultaneous fires, the percentage of active cells is higher in asmall CRBB than in the entire automaton. However, also in thiscase it is not advantageous to simulatemore than a certain number(i.e. about 100) of simultaneous fires.

An additional investigation was aimed at ascertaining whether,for the CRBB computation, the use of atomic operations would bemore suitable than the use of a PR. To this end, the two alternativeimplementations were compared using the same BPM adopted ascase study and represented in Fig. 5. As shown in Table 2, with theGPUs used here, the approach based on atomic operations providesa certain advantage in terms of computing time,which ranges from5% to 8%. However, it also necessary to consider that with someolder GPUs the PR approach can be necessary [19].

4.3.2. Mitigating thread divergenceThe use of a CRBB allows a considerable gain of computational

efficiency in both the sequential and parallel versions of the BPMbuilding procedure. However, as explained in Section 4.2, iteratingon the different simultaneous fires generates a frequent divergencebetween the threads of the same warp, with a consequent declineof efficiency.

Fig. 8. Time taken by the CRBBM approach for simulating the 7391 independentfires required by the BPM shown in Fig. 5. The elapsed time is plotted as a function ofthe number of fires that are simultaneously simulated by the GPU (see Algorithm2).The adopted CA neighbourhood size was |N | = 25.

Table 2Best elapsed times (in seconds) using two different approaches forcomputing the RBB during the simulations. The resultswere obtained usinga CA neighbourhood size |N | = 25.

Atomic Parallel reduction

GTX 580 99.8 108.8GTX 680 99.4 105.1Tesla C2075 138.8 152.2

Fig. 9. Time taken by the CRBB3D approach for simulating the 7391 independentfires required by the BPM shown in Fig. 5. The elapsed time is plotted as a functionof the number of fires that are simultaneously simulated by the GPU (see Algorithm2). The adopted CA neighbourhood size was |N | = 25.

To cope with this problem, a new transition-function kernelwas implemented, which operates on a single fire. Then, such akernel was mapped on a grid in which a separate copy of theCRBB is associated with each of the different fires. In other words,the grid of threads used for the CA transition function was three-dimensional, with the base represented by the CRBB and thevertical dimension corresponding to the fires. This is representedin Fig. 6, where the approach based on such a launch grid is labelledas CRBB3D.

The CRBB3D approach was then used for building the BPM inFig. 5 with the different adopted GPUs. Fig. 9 shows the achievedelapsed times as a function of the number of simultaneous fires.

S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194 1191

Fig. 10. The RBBMST approach for mapping the CA transition function into a CUDA grid of threads in case of two simultaneous fires. Each kernel operates on a differentcouple ⟨fire, RBB⟩. Kernels are executed on different CUDA streams.

Again, the GTX 680 and GTX 580 performed in a very similarway. In particular, the GTX 580 took the lowest time of 59.4 ssimultaneously simulating 50 fires, which corresponds to a parallelspeedup of 35.1. As shown in the graph, the improvements overthe CRBBM approach are significant. In particular, the computingtimes are better by a percentage which varies between 55% and68%, depending on the GPU.

However, in this case the existence of an optimal value of thenumber of simultaneous fires is particularly evident. This makesthe computational performance of the CRBB3D approach quitedependent on the choice of that parameter. At the basis of suchbehaviour is the fact that, since only the cells on the fire frontare computationally active, at each CA step and for each fire theCRBB still includes many inactive cells. Thus, when the numberof computationally active threads is high enough for the latencyhiding mechanism mentioned in Section 3, any further incrementof the simultaneous fires (i.e. starting from about 50) brings tothe unnecessary scheduling of inactive threads, with a consequentdecline of the overall computational performance. This despitethe fact that more simultaneous fires correspond to less kernelinvocations, that is to less overhead related to the launches.

4.3.3. Concurrent kernelsAn alternative approach for simulating more than a single fire

at a time, is offered by the capability of the recent CUDA platformsto simultaneously activate different kernels. In particular, startingfrom the Fermi architecture, up to 16 concurrent kernels aresupported on different streams [17,18].

To assess whether kernel concurrency could be effectivelyexploited for the BPM computational problem, the specificimplementation named RBBMST of Algorithm 2 was developed.In the latter, the function transitionFunctionInParallel exploits,in a round-robin fashion, the available streams for separatelyissuing a CA transition function kernel for each fire. As shownin Fig. 10, according to the RBBMST approach an independentRBB is associated to each fire instead of a common RBB. Thisis possible because the concurrent kernels operating on thedifferent fires do not have to share a common launch grid.Clearly, such fire-optimized RBBs allow for further lowering thenumber of computationally irrelevant threads that are scheduledfor execution.

The results obtained on the case studyunder investigation usingthe RBBMST approach are shown in Fig. 11. The elapsed time of77.8 s (26.8 of parallel speedup) was achieved by the GTX 580card, which performed better than the other devices, especially forthe lowest numbers of simultaneous fires. For the same reasonsalready highlighted for the WCAM and CRBBM cases, for all theconsidered GPUs concurrently simulating more than a certainnumber of fires did not lead to significant advantages.

Fig. 11. Time taken by the RBBMST approach for simulating the 7391 independentfires required by the BPM shown in Fig. 5. The elapsed time is plotted as a functionof the number of fires that are simultaneously simulated by the GPU (see Algorithm2). The adopted CA neighbourhood size was |N | = 25.

Compared with the CRBB3D, the RBBMST approach exhibiteda worse performance. Some instrumented runs showed that thiswas due to the higher overhead related to the kernel launches. Infact, regardless of the number of simultaneous fires, in the RBBMSTscheme a kernel implementing the CA transition function is issuedfor each fire at each step of the simulation. Instead, in the CRBB3Da single kernel invocation concerns all the simultaneous fires.

4.4. Linear kernel mapping

As pointed out above, during the simulation of a wildfirethe CRBB still includes some inactive cells. For example, all theburned cells inside the current fire front are typically within theCRBB though they are not computationally significant. Therefore, amore sophisticated strategy was developed, in which the relevantkernels are mapped only on the actually burning cells.

Also in this approach, labelled as LISTM, many fires at the sametime are simulated for providing to the GPU enough threads tofully enable latency hiding. However, instead of computing a CRBB,during the CA simulations a list L of the currently burning cells ismaintained on the GPU global memory.

More in particular, as depicted in Fig. 12, two copies of the listare used, namely Lt and Lt+1. Both are implemented through anarray allocated in the device global memory. At the beginning ofthe tth CA step, Lt contains the three-dimensional indices of thecells belonging to the current fire fronts defined as:

ξ(k)i, j = mi+ j+ nmk (4)

1192 S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194

Fig. 12. The LISTM approach for mapping the transition function of a n × m automaton into a one-dimensional CUDA grid of threads. The array Lt contains the three-dimensional indices ξ (k)i,j of the cells belonging to the current fire fronts. At the step t , the size of Lt+1 is set to |Lt | |N | and all the entries are initialized to a negative value.Then, a thread implementing the CA transition function is issued for each element of Lt . In particular, each thread updates a portion of Lt+1 having the size of the cellneighbourhood |N |. At the end of the step, a parallel stream compaction procedure removes the unused entries from Lt+1 .

where i and j are the row and column index of the cell, k is theindex of the fire, n and m are the number of rows and columns ofthe automaton, respectively.

Then, a PR kernel issued by findStepSizesInParallel is mapped onthe elements of Lt to find the current step sizes for each fire.

Subsequently, the size of Lt+1 is set to |Lt | |N | and all theentries are initialized to a value which does not represent a validcell index according to Eq. (4) (e.g. a negative integer).

Once Lt+1 has been initialized, a thread implementing the CAtransition function is issued for each element of Lt . In particular,each thread is responsible for updating a specific portion of Lt+1having the size of the cell neighbourhood |N |. In the fire spreadmechanism, if a burning cell is still burning at the end of the step,then its index ξ is read from Lt and inserted in the proper positionofLt+1. Also, when a cell propagates the fire to a neighbouring cell,the index of the newly ignited cell given by Eq. (4) is inserted inLt+1.

At the end of the step, a parallel stream compaction proce-dure [28,36] removes the unused entries from Lt+1 (i.e. those thatstill have the initial value). In particular, the efficient algorithmpresented in [10] was included in the program for such purpose.

Solving the test case of Section 4.1 with the LISTM approach,led to the results shown in Fig. 13. The lower elapsed time wasachieved with the GTX 580 card, which provided the BPM in45.8 s using 81 simultaneous fires. Overall, the GTX 580 performedslightly better than the GTX 680.

An analogous strategy, in which the burning cells are trackedby a list, was developed for the sequential platform. In particular,using the reference CPU, such sequential procedure required2005.9 s for computing the BPM of Section 4.1. Therefore, thecorresponding parallel speedup achieved by the GTX 580 using theLISTM approach was 43.8.

4.5. Summary of results and further discussion

For clarity reasons, a summary of the elapsed times achievedabove for solving the test problem is shown in Table 3. Moreover,the latter shows the computing times takenby the samealgorithmswith the use of a standard Moore’s neighbourhood (i.e. |N | = 9).Since, as mentioned in Section 4.1, a smaller neighbourhood stillprovides an acceptable BPM, it is of some interest to investigatethe corresponding computational gain and parallel performance.As can be seen, the GTX 580 card took only 29 s to build the slightlyless accurate BPM, with a parallel speedup of 32.9. Overall, Table 3

Fig. 13. Time taken by the LISTM approach for simulating the 7391 independentfires required by the BPM shown in Fig. 5. The elapsed time is plotted as afunction of the number of fires that are simultaneously simulated by the GPU (seeAlgorithm 20). The adopted CA neighbourhood size was |N | = 25.

Table 3Elapsed times (in seconds) for simulating the 7391 independent fires required bythe BPM shown in Fig. 5.

WCAM CRBBM CRBB3D RBBMST LISTM

Neighbourhood size = 9

CPU 9893.9 1049.8 – – 954.5GTX 580 443.3 53.6 40.1 52.3 29.0GTX 680 390.1 54.0 43.4 74.8 35.0Tesla C2075 672.1 76.8 59.4 80.1 46.4

Neighbourhood size = 25

CPU 14,167.4 2082.6 – – 2005.9GTX 580 515.8 99.8 59.4 77.8 45.8GTX 680 443.4 99.4 61.1 86.5 52.5Tesla C2075 765.2 138.8 89.8 115.0 76.0

shows that the results in terms of time savings are significant,especially using the LISTM approach.

To get a deeper insight into the functioning of the proposedapproaches, a further investigation was aimed at assessingthe composition of the minimum achieved computing time. Inparticular, the different algorithms were instrumented to measure

S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194 1193

Fig. 14. Composition of theminimumcomputing time tookby the Tesla C2075withthe adopted approaches. In theWCAM case, most of the computing time is taken bythe parallel reduction used for adapting the time step during the simulation. Themost efficient CUDA kernel implementing the CA transition function was obtainedfollowing the LISTM approach.

the time took by their different components. The results obtainedwith the Tesla C2075 are depicted in Fig. 14.

According to the latter, in the inefficient WCAM case, the 62%of the computing time was taken by the PR used for adapting thetime step during the simulation.

Also, the results show that using the LISTM approach does notprovide any computational advantages in the PR used for adaptingthe time step. This is related to the fact that finding the maximumallowed steps for a high number of simultaneous fires preventsthe PR from exploiting the limited GPU shared memory. Arguably,there is still some room for improvement of this component of thealgorithm, which will be object of future work.

As expected, the most efficient CUDA kernel implementingthe CA transition function was obtained following the LISTMparallelization. Interestingly, the stream compaction procedurerequired by such an approach only took the 8% of the totalcomputing time.

A final remark concerns the Remaining operations in Fig. 14.As can be seen, a significant gain was found for such an entry inall the approaches more advanced than the WCAM. This is mainlydue to the mentioned substitution of the simple device-to-devicememory copy operation, used to re-initialize the CAcur values ofthe substates with CAnext at the end of each CA step, with a kernelthat avoids the copying of most of the memory areas that have notchanged.

5. Conclusions

Starting from the problem of BPM computation, this studydiscussed several approaches for the simultaneous simulation ofa large number of wildfires using GPGPU.

One of the advantages of the presented solutions lies in enablingthe building of BPMs for large areas (e.g. at a regional level), whichotherwise may not be possible by adopting ordinary sequentialcomputation.

The parallel speedups attained through the proposed ap-proacheswere indeed satisfactory. Even if the first, straightforwardparallelization of the BPM computation lead to positive results, sig-nificant improvements were then achieved by taking into accountthe relevant characteristics of both the computational problemandthe adopted GPGPU platform. This, while confirming on one handthe effectiveness of GPGPU as an alternative to traditional parallelparadigms shows, on the other, how the adoption of ad hoc strate-gies can be crucial to the success of the parallelization.

In addition to the construction of BPMs, the presented strategiesmay also be adopted for carrying out extensive sensitivity analysisto model parameters or for the automatic planning of risk-mitigation interventions on the landscape (i.e. the so-called fueltreatments) [2]. Moreover, the fast simulation of a number ofwildfires gives the opportunity to develop a support tool for thereal-time optimization of fire-fighting activities [29].

Future work will focus on applying, with the requiredadjustments, the GPGPU parallelization strategies investigated inthis paper to other research and application areas. In fact, thealgorithm for building BPMs is somewhat similar to those adoptedfor dealing with other natural hazards, such as the risk induced bydebris or lava flows (e.g. [38,16]).

Acknowledgments

This work was partially funded by the European Commission -European Social Fund (ESF) and by the Regione Calabria (Italy).

References

[1] A. Ager, M. Finney, Application of wildfire simulation models for risk analysis,in: Geophysical Research Abstracts, Vol. 11, EGU General Assembly, 2009,EGU2009-5489.

[2] A.A. Ager, N.M. Vaillant,M.A. Finney, A comparison of landscape fuel treatmentstrategies tomitigate wildland fire risk in the urban interface and preserve oldforest structure, Forest Ecology and Management 259 (8) (2010) 1556–1570.

[3] A.A. Ager, N.M. Vaillant, M.A. Finney, H.K. Preisler, Analyzingwildfire exposureand source-sink relationships on a fire prone forest landscape, Forest Ecologyand Management 267 (0) (2012) 271–283.

[4] M. Alexander, Estimating the length-to-breadth ratio of elliptical forest firepatterns, in: Proc. 8th Conf. Fire and Forest Meteorology, 1985, pp. 287–304.

[5] H. Anderson, Predicting wind-driven wildland fire size and shape, Tech. Rep.INT-305, US Department of Agriculture, Forest Service, 1983.

[6] P. Andrews, BEHAVE: fire behavior prediction and fuelmodeling system—burnsubsystem, part 1, Tech. Rep. INT-194, US Department of Agriculture, ForestService, 1986.

[7] M.V. Avolio, G.M. Crisci, S. Di Gregorio, R. Rongo, W. Spataro, G.A. Trunfio,SCIARA γ 2: an improved cellular automata model for lava flows andapplications to the 2002 Etnean crisis, Computers & Geosciences 32 (7) (2006)876–889.

[8] M.V. Avolio, S. Di Gregorio, V. Lupiano, G.A. Trunfio, Simulation of wildfirespread using cellular automata with randomized local sources, in: ACRI 2012,Vol. 7495, in: LNCS, Springer, Berlin/Heidelberg, 2012, pp. 279–288.

[9] Porting and optimizing MAGFLOW on CUDA, Annals of Geophysics 5 (54).[10] M. Billeter, O. Olsson, U. Assarsson, Efficient stream compaction on

wide simd many-core architectures, in: Proceedings of the ACM SIG-GRAPH/EUROGRAPHICS Conference onHigh Performance Graphics 2009, NewOrleans, Louisiana, USA, August 1–3, 2009, ACM, 2009, pp. 159–166.

[11] I. Blecic, A. Cecchini, G.A. Trunfio, A generalized rapid developmentenvironment for cellular automata based simulations, in: Cellular Automata,in: LNCS, vol. 3305, Springer, Berlin, Heidelberg, 2004, pp. 851–860.

[12] I. Blecic, A. Cecchini, G.A. Trunfio, A general-purpose geosimulation infrastruc-ture for spatial decision support, Transactions on Computational Science 6(2009) 200–218.

[13] C.R. Calidonna, A. Naddeo, G.A. Trunfio, S.D. Gregorio, From classical infinitespace–time CA to a hybrid CA model for natural sciences modeling, AppliedMathematics and Computation 218 (16) (2012) 8137–8150.

[14] Y. Carmel, S. Paz, F. Jahashan, M. Shoshany, Assessing fire risk using MonteCarlo simulations of fire spread, Forest Ecology and Management 257 (1)(2009) 370–377.

[15] EEA, CORINE land cover—technical guide, Tech. Rep. RMRS-RP-4, Office forOfficial Publications of European Communities, 1993.

[16] G.M. Crisci, M.V. Avolio, B. Behncke, D.D’ Ambrosio, S. Di Gregorio, V. Lupiano,M. Neri, R. Rongo, W. Spataro, Predicting the impact of lava flows at MountEtna, Italy, Journal of Geophysical Research: Solid Earth 115 (B4).

[17] CUDA C Programming Guide, v. 3.2, 2010.[18] CUDA C Best Practices Guide, 2012.[19] D. D’Ambrosio, S. Di Gregorio, G. Filippone, R. Rongo, W. Spataro, G.A. Trunfio,

Fast assessment of wildfire spatial hazard with GPGPU, in: SIMULTECH 2012—Proceedings of the 2nd International Conference on Simulation and ModelingMethodologies, Technologies and Applications, 2012, pp. 260–269.

[20] D. DAmbrosio, G. Filippone, R. Rongo,W. Spataro, G. Trunfio, Cellular automataand GPGPU: an application to lava flowmodeling, International Journal of Gridand High Performance Computing 4 (3) (2012) 30–47.

[21] S. Di Gregorio, R. Serra, An empirical method for modelling and simulatingsome complex macroscopic phenomena by cellular automata, Future Genera-tion Computer Systems 16 (2–3) (1999) 259–271.

[22] S. Di Gregorio, R. Serra, M. Villani, Applying cellular automata to complex en-vironmental problems: the simulation of the bioremediation of contaminatedsoils, Theoretical Computer Science 217 (1) (1999) 131–156.

1194 S. Di Gregorio et al. / J. Parallel Distrib. Comput. 73 (2013) 1183–1194

[23] G. Filippone, W. Spataro, G. Spingola, D. D’Ambrosio, R. Rongo, G. Perna, S. DiGregorio, GPGPU programming and cellular automata: Implementation of theSCIARA lava flow simulation code, in: 23rd EuropeanModeling and SimulationSymposium SEMSS, Rome, Italy, 2011.

[24] M.A. Finney, The challenge of quantitative risk analysis forwildland fire, ForestEcology and Management 211 (1–2) (2005) 97–108.

[25] M.A. Finney, FARSITE: fire area simulator-model development and evaluation,Tech. Rep. RMRS-RP-4, US Department of Agriculture, Forest Service, 2004,(February 2004).

[26] J. Forthofer, K. Shannon, B. Butler, Simulating diurnally driven slope windswith WindNinja, in: Proceedings of 8th Symposium on Fire and ForestMeteorological Society Kalispell, MT, 2009.

[27] I. French, D. Anderson, E. Catchpole, Graphical simulation of bushfire spread,Mathematical Computer Modelling 13 (1990) 67–71.

[28] D. Horn, Stream Reduction Operations for GPGPU Applications, AddisonWesley, 2005, pp. 573–589 (Chapter 36).

[29] X. Hu, L. Ntaimo, Integrated simulation and optimization for wildfirecontainment, ACM Transactions onModeling and Computer Simulation 19 (4)(2009) 1–29.

[30] P. Johnston, J. Kelso, G. Milne, Efficient simulation of wildfire spread on anirregular grid, International Journal of Wildland Fire 17 (2008) 614–627.

[31] P.H. Kourtz, W.G. O’Regan, A model for a small forest fire to simulate burnedand burning areas for use in a detection model, Forest Science 17 (7) (1971)163–169.

[32] A.M.G. Lopes, M.G. Cruz, D.X. Viegas, Firestation—an integrated softwaresystem for the numerical simulation of fire spread on complex topography,Environmental Modelling and Software 17 (3) (2002) 269–285.

[33] A.B. Massada, V.C. Radeloff, S.I. Stewart, T.J. Hawbaker, Wildfire risk in thewildland-urban interface: a simulation study in Northwestern Wisconsin,Forest Ecology and Management 258 (9) (2009) 1990–1999.

[34] R. McAlpine, B. Lawson, E. Taylor, Fire spread across a slope, in: Proceedingsof the 11th Conference on Fire and Forest Meteorology, Society of AmericanForesters, Bethesda, MD, 1991, pp. 218–225.

[35] S.H. Peterson, M.E. Morais, J.M. Carlson, P.E. Dennison, D.A. Roberts, M.A.Moritz, D.R. Weise, Using HFIRE for spatial modeling of fire in shrublands,Tech. Rep. PSW-RP-259, US Department of Agriculture, Forest Service, PacificSouthwest Research Station, Albany, CA, 2009.

[36] D. Roger, U. Assarsson, N. Holzschuch, Efficient stream reduction on the GPU,in: D. Kaeli, M. Leeser (Eds.), Workshop on General Purpose Processing onGraphics Processing Units, 2007.

[37] R. Rongo, V. Lupiano, M.V. Avolio, D. D’Ambrosio, W. Spataro, G.A. Trunfio,Cellular automata simulation of lava flows—applications to civil defenseand land use planning with a cellular automata based methodology, in:SIMULTECH 2011 - Proceedings of 1st International Conference on Simulationand Modeling Methodologies, Technologies and Applications.

[38] R. Rongo, W. Spataro, D. D’Ambrosio, M.V. Avolio, G.A. Trunfio, S. DiGregorio, Lava flow hazard evaluation through cellular automata and geneticalgorithms: an application to Mt Etna volcano, Fundamenta Informaticae 87(2) (2008) 247–267.

[39] R.C. Rothermel, A mathematical model for predicting fire spread in wildlandfuels, Tech. Rep. INT-115, US Department of Agriculture, Forest Service,Intermountain Forest and Range Experiment Station, Ogden, UT, 1972.

[40] R.C. Rothermel, How to predict the spread and intensity of forest and rangefires, Tech. Rep. INT-143, US Department of Agriculture, Forest Service,Intermountain Forest and Range Experiment Station, Ogden, UT, 1983.

[41] W. Spataro, M.V. Avolio, V. Lupiano, G.A. Trunfio, R. Rongo, D. D’Ambrosio, Thelatest release of the lava flows simulation model SCIARA: first application toMt. Etna (Italy) and solution of the anisotropic flow direction problem on anideal surface, Procedia CS 1 (1) (2010) 17–26.

[42] A. Sullivan, Wildland surface fire spread modelling, 1990–2007. 3: simulationand mathematical analogue models, International Journal of Wildland Fire 18(2009) 387–403.

[43] P.M. Torrens, I. Benenson, Geographic automata systems, International Journalof Geographical Information Science 19 (4) (2005) 385–412.

[44] G.A. Trunfio, R. Rongo, D. D’Ambrosio, W. Spataro, S. Di Gregorio, A newalgorithm for simulating wildfire spread through cellular automata, ACMTransactions on Modeling and Computer Simulation 22 (1) (2011) 6.

[45] J. Vasconcelos, B. Zeigler, J. Pereira, Simulation of fire growth in GIS usingdiscrete event hierarchical modular models, Advances in Remote Sensing 4(3) (1995) 54–62.

[46] S. Yassemi, S. Dragicevic, M. Schmidt, Design and implementation of anintegrated GIS-based cellular automata model to characterize forest firebehaviour, Ecological Modelling 210 (1–2) (2008) 71–84.

Salvatore Di Gregorio was born in Castellammare delGolfo, Sicily, Italy, in 1948. He received his ‘‘Laurea’’ degreein Physics from the University of Palermo in 1972 andcontinued studies and research in Computer Science atthe University of Naples and at the University of theBritish Columbia (1972–1978). In 1978, he joined theUniversity of Calabria (Italy) as an ‘‘external’’ Professor ofComputer Science at the Systems Science Department. In1985, he became an Associate Professor at the Departmentof Mathematics. Since 2002, he is Full Professor. Hisresearch interests include Parallel Computing, Modelling

and Simulation of Complex Phenomena by Cellular Automata (surface flows, traffic,bioremediation, forest fires, ecosystems). He is the author ofmore than onehundredinternational publications. Prof. Di Gregoriowas the founder of the ACRI conferenceand was convenor for EGU, AOGS, iEMSs conferences and guest editor for severalinternational journals.

Giuseppe Filippone obtained his B.S degree (LaureaDegree) in Computer Science at the University of Calabria(Italy) in 2010with themaximum final score. Currently, heis a Ph.D. student in Computer Science at the Departmentof Mathematics and Computer Science at the Universityof Calabria (Italy). Within his Ph.D. study course, he hasalso conducted some research at the Plymouth University(UK) - Centre of Robotics and Neural Systems - onseveral aspects of computer science regarding differentfields, from parallel computing to simulation of complexsystems, from genetic algorithms to computer graphics.

William Spataro was born in London (UK) and earnedhis Italian Laurea Degree in Applied Mathematics Summacum Laude in 1992. Currently, he is a researcher in com-puter science at the Department ofMathematics and Com-puter Science at the University of Calabria (Italy) wherehe teaches Parallel Computing and Distributed Systems atthe MS Computer Science Course. His research interestsare focused on parallel computing, high performance com-puting, modelling and simulation of complex phenomena,cellular automata and genetic algorithms.

Giuseppe A. Trunfio is a researcher in Computer Engi-neering at the University of Sassari, Italy. He holds a Ph.D.in Computational Mechanics from University of Calabria(Italy) and has been a research fellow at the Institutefor Atmospheric Pollution of the Italian National ResearchCouncil. His main research interests focus on modellingand simulation, cellular automata, high performance com-puting and evolutionary computation. In particular, he hasbeen involved in many EU-funded and national interdisci-plinary research projects, where his contributions mainlyconcerned the development of simulation models, the de-

sign and implementation of decision support systems and advanced applications ofcomputational methods.