GPU Acceleration of Tsunami Propagation Model

10
1014 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 3, JUNE 2012 GPU Acceleration of Tsunami Propagation Model Muhammad T. Satria, Bormin Huang, Tung-Ju Hsieh, Yang-Lang Chang, Senior Member, IEEE, and Wen-Yew Liang Abstract—Tsunami propagation in shallow water zone is often modeled by the shallow water equations (also called Saint-Venant equations) that are derived from conservation of mass and conser- vation of momentum equations. Adding friction slope to the con- servation of momentum equations enables the system to simulate the propagation over the coastal area. This means the system is also able to estimate inundation zone caused by the tsunami. Applying Neumann boundary condition and Hansen numerical lter bring more interesting complexities into the system. We solve the system using the two-step nite-difference MacCormack scheme which is potentially parallelizable. In this paper, we discuss the parallel im- plementation of the MacCormack scheme for the shallow water equations in modern graphics processing unit (GPU) architecture using NVIDIA CUDA technology. On a single Fermi-generation NVIDIA GPU C2050, we achieved 223x speedup with the result output at each time step over the original C code compiled with -O3 optimization ag. If the experiment only outputs the nal time step result to the host, our CUDA implementation achieved around 818x speedup over its single-threaded CPU counterpart. Index Terms—CUDA, GPU, MacCormack scheme, Saint- Venant system, shallow water equations, tsunami. I. INTRODUCTION L IKE many other researchers in the world, the latest de- structive earthquake followed by tsunami that hit Japan in July 2011 has been motivating us to get involved in research work in tsunami topics. Based on our previous work in [1], we try to improve the computing performance by implementing the tsunami model on a latest Fermi-generation GPU using CUDA technology. Unlike the shader programming technique we used in previous work, CUDA gives us more freedom in utilizing the parallelism and more access to the computing resources in GPU. GPU-based computing has been applied to many science and engineering elds, including remote sensing. Lee et al. [2] re- view recent advances in HPC including GPU technology for re- mote sensing. Christophe [3] presents an efcient GPU-based framework of image processing algorithm for remote sensing applications. A GPU implementation on regular LDPC decoder gains 271x speedup compared to CPU-based version as report Manuscript received September 30, 2011; revised December 27, 2011, and May 02, 2012; accepted May 02, 2012. Date of publication June 13, 2012; date of current version June 28, 2012. M. T. Satria and B. Huang are with the Space Science and Engineering Center, University of Wisconsin-Madison, Madison, WI 53706 USA (corresponding author e-mail: [email protected]). T. J. Hsieh and W. Y. Liang are with the Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei 10608, Taiwan. Y. L. Chang is with the Department of Electrical Engineering, National Taipei University of Technology, Taipei 10608, Taiwan. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/JSTARS.2012.2199468 by [4]. Wei et al. [5] report 72x speedup for GPU-based im- plementation Predictive partitioned vector quantization (PPVQ) compression scheme. Song et al. [6] propose CPU and GPU heterogeneous computing for real-time Reed-Solomon channel decoding and source decoding system. GPU is used for time-consuming inverse DWT, while multiple CPU threads are run in parallel for the remaining parts. A GPU- based high-performance Radiative Transfer Model (RTM) for the Infrared Atmospheric Sounding Interferometer (IASI) pro- posed in [7] can reach 763x speedup for single GPU and 3024x speedup for 4 GPUs, with respect to the single-threaded Fortran CPU code. In this paper, we discuss the CUDA implementation of tsunami model and simulate tsunami propagation over shallow water zone. Without losing its generality for benchmarking our GPU-based Tsunami model performance, we use a set of simulated wave data and the spline interpolation to generate the initial tsunami wave. The proposed model can also adopt the initial wave data from the Global Sea Level Observing System (GLOSS) data record, if available. GLOSS is an inter- national program for the establishment of high quality global and regional sea level data. It is run and monitored under co- ordination of Intergovernmental Oceanographic Commission (IOC). GLOSS measures sea level globally and transmit it via satellite to the Pacic Tsunami Warning Center (PTWC) in short time interval. Unlike Deep-ocean Assessment and Reporting of Tsunamis (DART) which aims to collect wave data on deep ocean zone, most GLOSS transducers are located in shallow water zone and close to land masses. This means it provides more reliable and actual sea level data. In the future, with those adequate data from GLOSS we hope our GPU-based Tsunami simulation can provide very fast estimations with modest computing resources to help in the prediction of the potentially hazardous areas for tsunami mitigation strategies. In shallow water zone, tsunami propagation is modeled as the Saint-Venant system which is derived from equations of conser- vation of mass and conservation of momentum. Adding friction slope to the equations of conservation of momentum enables the system to propagate the tsunami over a coastal area, which means the system is also able to estimate the inundation zone in a coastal area [8]. The friction slope is the rate at which energy is lost due to channel resistance and it will affect the propaga- tion ow in open channels. Many numerical approaches have been used by researchers to solve the equations. In [9], Hagen used the Lax–Friedrich method. However, it is a very robust scheme, which unfor- tunately gives excessive smearing of non-smooth parts of the solution. Kass [10] used an alternate direction implicit method (ADI). Layton [11] and Kuo [12] used a semi-Lan- grangian method. Karsten [13] used the Jacobi iteration 1939-1404/$31.00 © 2012 IEEE

Transcript of GPU Acceleration of Tsunami Propagation Model

1014 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 3, JUNE 2012

GPU Acceleration of Tsunami Propagation ModelMuhammad T. Satria, Bormin Huang, Tung-Ju Hsieh, Yang-Lang Chang, Senior Member, IEEE, and

Wen-Yew Liang

Abstract—Tsunami propagation in shallow water zone is oftenmodeled by the shallow water equations (also called Saint-Venantequations) that are derived from conservation of mass and conser-vation of momentum equations. Adding friction slope to the con-servation of momentum equations enables the system to simulatethe propagation over the coastal area. This means the system is alsoable to estimate inundation zone caused by the tsunami. ApplyingNeumann boundary condition and Hansen numerical filter bringmore interesting complexities into the system. We solve the systemusing the two-step finite-difference MacCormack scheme which ispotentially parallelizable. In this paper, we discuss the parallel im-plementation of the MacCormack scheme for the shallow waterequations in modern graphics processing unit (GPU) architectureusing NVIDIA CUDA technology. On a single Fermi-generationNVIDIA GPU C2050, we achieved 223x speedup with the resultoutput at each time step over the original C code compiled with-O3 optimization flag. If the experiment only outputs the final timestep result to the host, our CUDA implementation achieved around818x speedup over its single-threaded CPU counterpart.

Index Terms—CUDA, GPU, MacCormack scheme, Saint-Venant system, shallow water equations, tsunami.

I. INTRODUCTION

L IKE many other researchers in the world, the latest de-structive earthquake followed by tsunami that hit Japan

in July 2011 has been motivating us to get involved in researchwork in tsunami topics. Based on our previous work in [1], wetry to improve the computing performance by implementing thetsunami model on a latest Fermi-generation GPU using CUDAtechnology. Unlike the shader programming technique we usedin previous work, CUDA gives us more freedom in utilizing theparallelism andmore access to the computing resources in GPU.GPU-based computing has been applied to many science and

engineering fields, including remote sensing. Lee et al. [2] re-view recent advances in HPC including GPU technology for re-mote sensing. Christophe [3] presents an efficient GPU-basedframework of image processing algorithm for remote sensingapplications. A GPU implementation on regular LDPC decodergains 271x speedup compared to CPU-based version as report

Manuscript received September 30, 2011; revised December 27, 2011, andMay 02, 2012; accepted May 02, 2012. Date of publication June 13, 2012; dateof current version June 28, 2012.M. T. Satria and B. Huang are with the Space Science and Engineering Center,

University of Wisconsin-Madison, Madison, WI 53706 USA (correspondingauthor e-mail: [email protected]).T. J. Hsieh and W. Y. Liang are with the Department of Computer Science

and Information Engineering, National Taipei University of Technology, Taipei10608, Taiwan.Y. L. Chang is with the Department of Electrical Engineering, National Taipei

University of Technology, Taipei 10608, Taiwan.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JSTARS.2012.2199468

by [4]. Wei et al. [5] report 72x speedup for GPU-based im-plementation Predictive partitioned vector quantization (PPVQ)compression scheme. Song et al. [6] propose CPU and GPUheterogeneous computing for real-time Reed-Solomon channeldecoding and source decoding system. GPUis used for time-consuming inverse DWT, while multiple CPUthreads are run in parallel for the remaining parts. A GPU-based high-performance Radiative Transfer Model (RTM) forthe Infrared Atmospheric Sounding Interferometer (IASI) pro-posed in [7] can reach 763x speedup for single GPU and 3024xspeedup for 4 GPUs, with respect to the single-threaded FortranCPU code.In this paper, we discuss the CUDA implementation of

tsunami model and simulate tsunami propagation over shallowwater zone. Without losing its generality for benchmarkingour GPU-based Tsunami model performance, we use a set ofsimulated wave data and the spline interpolation to generatethe initial tsunami wave. The proposed model can also adoptthe initial wave data from the Global Sea Level ObservingSystem (GLOSS) data record, if available. GLOSS is an inter-national program for the establishment of high quality globaland regional sea level data. It is run and monitored under co-ordination of Intergovernmental Oceanographic Commission(IOC). GLOSS measures sea level globally and transmit itvia satellite to the Pacific Tsunami Warning Center (PTWC)in short time interval. Unlike Deep-ocean Assessment andReporting of Tsunamis (DART) which aims to collect wavedata on deep ocean zone, most GLOSS transducers are locatedin shallow water zone and close to land masses. This means itprovides more reliable and actual sea level data. In the future,with those adequate data from GLOSS we hope our GPU-basedTsunami simulation can provide very fast estimations withmodest computing resources to help in the prediction of thepotentially hazardous areas for tsunami mitigation strategies.In shallow water zone, tsunami propagation is modeled as the

Saint-Venant system which is derived from equations of conser-vation of mass and conservation of momentum. Adding frictionslope to the equations of conservation of momentum enablesthe system to propagate the tsunami over a coastal area, whichmeans the system is also able to estimate the inundation zone ina coastal area [8]. The friction slope is the rate at which energyis lost due to channel resistance and it will affect the propaga-tion flow in open channels.Many numerical approaches have been used by researchers

to solve the equations. In [9], Hagen used the Lax–Friedrichmethod. However, it is a very robust scheme, which unfor-tunately gives excessive smearing of non-smooth parts ofthe solution. Kass [10] used an alternate direction implicitmethod (ADI). Layton [11] and Kuo [12] used a semi-Lan-grangian method. Karsten [13] used the Jacobi iteration

1939-1404/$31.00 © 2012 IEEE

SATRIA et al.: GPU ACCELERATION OF TSUNAMI PROPAGATION MODEL 1015

Fig. 1. Illustration of shallowwater zone. h is water depth, u and v are velocitiesin respective x-y direction, i and j are discretization of x-y axis.

method. Brodtkorb [14] discussed Kurganov–Levy andKurganov–Petrov approaches for the Saint-Venant system. Weprefer to use the MacCormack scheme [15] which is well suitedfor nonlinear equations such as the Saint-Venant system. TheMacCormack scheme is a variation of the two-step Lax–Wen-droff method, but much simpler and gives better results fornonlinear equations [16]. In the predictor step, forward differ-ence is applied, and then followed by backward difference inthe corrector. The MacCormack scheme is potentially paral-lelizable. In this paper, we discuss the parallel implementationof the MacCormack scheme for the Saint-Venant system usingCUDA technology.Brodtkorb [14] has discussed CUDA implementation

of Kurganov–Levy and Kurganov–Petrov methods for theSaint-Venant system. However, they implemented it in aGeForce GTX 285 which was a mid-2009 generation ofNVIDIA GPU card and has lower computing capability. Con-sidering that limitation, they gave more focus on the efficiencybetween single and double precision instead of the speedup per-formance. In this work, we use the latest high-end generationof NVIDIA GPU card which has new GPU architecture calledFermi technology. Our objective is to compare the computingperformance with respect to the original C code.This paper is organized as follows. Section II will discuss the

tsunami propagation model. Section III explains the simulationflow that we use in original C code. Introduction of GPU-basedcomputing will be explained in Section IV. Our CUDA im-plementation and the performance result will be discussed inSection V. Finally, Section VI concludes our work.

II. TSUNAMI PROPAGATION MODEL

In our work, we choose shallow water zone as computationaldomain as illustrated in Fig. 1. It is a zone where bed slopeand friction slope still influence the wave transport in the sur-face. For this situation, the Saint-Venant system is applicable tomodel the tsunami propagation.

A. The Saint-Venant System

The Saint-Venant system is often called the shallow waterequations. It is widely applicable for topics in shallow water orsituations where the horizontal length scale is much greater thanthe vertical length scale [17].In two-dimensional space, the Saint-Venant system is de-

scribed as follows:

(1)

Fig. 2. Stencil of the MacCormack scheme in the x-axis.

(2)

(3)

where represents the water depth or height of sea surface rela-tive to the bottom, and respectively are the velocity in -axisand -axis. is the gravity acceleration, is the time step, andis for bed slope. is for friction slope and defined by the Man-ning equation. The friction slopes and are described asfollows:

(4)

(5)

In (4) and (5), is the empirical Manning coefficient ofroughness. We assume is equal to 0.05 as a representation oflight brush surface over coastal area [18]. Equations (1), (2),and (3) are then rewritten as

(6)

(7)

(8)

B. Applying the MacCormack Scheme

The MacCormack scheme is a two-step explicit method. Pro-visional values are calculated in the predictor step and then up-dated in the corrector step. Forward difference is applied in thepredictor step, while the corrector step applies backward differ-ence. Fig. 2 illustrates the stencil of the MacCormack schemein the -axis.The provisional values of , and at predictor step (de-

noted by ) are described as follows:

(9)

1016 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 3, JUNE 2012

(10)

(11)

In the corrector step, updated values of and and timestep are described as follows:

(12)

(13)

(14)

C. Boundary Condition

The Neumann boundary condition is applied for height andvelocity in the corresponding axis. For velocity that is perpen-dicular to the axis, we apply a simple Copy boundary condition.It propagates the velocity in interior nodes of the computationaldomain to the outer nodes. Wet-dry condition is also applied anddefined by giving a tolerance for height. If a node has height thatis lower than tolerance, then its height and velocities will be setto 0.

D. Numerical Filter

The Hansen numerical filter is applied in order to get betterstability. In each node, the height and velocities will be updatedby the following equation:

(15)

Fig. 3. Stencil of Hansen numerical filter.

Fig. 4. Computational domain at the north of Sumatera Island.

Fig. 5. Grid of computational domain.

represents the and in a node, and is a constantvalue equal to 0.99 [19]. It works as an artificial dissipation asdescribed in Fig. 3.

E. Initial Wave Generation

In our simulation, we set a dummy data of wave elevationwith corresponding time. In every time step, we apply splineinterpolation to the data in order to get an interpolated initialwave elevation. Visually, this way gives more realistic initialwave rather than using sinusoid. Initial velocity is set to 0,and initial velocity is calculated by the following equation:

(16)

where is gravity acceleration and is the addition of initialwater depth and interpolated initial wave elevation.

III. SIMULATION FLOW

In our simulation, the computational domain is a grid size of80 180 nodes with 200 meters of space interval. It represents a16 km 36 km area at the north of Sumatera Island as shown inFig. 4. We take this area because it represents an open channel.In such case, the tsunami wave or propagation flow will go toone direction uniformly. Fig. 5 illustrates invisible grid on the

SATRIA et al.: GPU ACCELERATION OF TSUNAMI PROPAGATION MODEL 1017

Fig. 6. Illustration of tsunami simulation scenario.

Fig. 7. Flowchart of the simulation process.

Fig. 8. Pseudocode of predictor step for C implementation.

domain. In Fig. 6, is the index of -axis and is the index of-axis. Initial wave elevation will be generated for all nodes inequal to 0. The generated wave will then be propagated by up-dating and values in each node using the MacCormacknumerical scheme. The whole simulation process is describedby the flowchart in Fig. 7. It starts from the initial data (step1) that consists of height maps of the bottom topography andwater level data. In step 2, the initial wave is generated. Theprovisional values of and will be calculated in the pre-dictor step (step 3). Then, the corrector step (step 4) is invokedand followed by performing the filtering step (step 5). After that,the output data will be stored (step 6). These processes will berepeated for all time steps.Fig. 8 shows a pseudo-code of the predictor step for original

C implementation. The original C code is compiled in an Intel

TABLE IEXECUTION TIME OF C IMPLEMENTATION

Xeon E5620 2.4 GHz PC with 64-bit Linux operating systemand by using Linux GCC compiler version 4.1.2 with -O3 op-timization flag. Lines 8 and 9 indicate that the computation fornodes in region of andare covered by looping. For nodes with index and

, the boundary conditions will be applied. Equation(9) is expressed at line 10 to 12. Table I shows the execution timeof original C code with corresponding number of time steps.

IV. GPU-BASED COMPUTING

Nowadays, several modern frameworks in general-purposecomputing on GPUs (GPGPU) have been proposed, namely,CUDA, OpenCL, and DirectCompute. They are replacingshader programming technique in utilizing the parallelismprovided by GPU. CUDA is developed by NVIDIA. Therefore,it only runs in NVIDIA GPU. OpenCL was first proposed byApple, but then it has been maintained by Khronos group untilnow. OpenCL offers cross platform ability, but the growthof OpenCL is not as fast as CUDA. Another framework,DirectCompute, is proposed and developed by Microsoft. Itis bundled in Microsoft DirectX of APIs. Therefore, it onlyruns in Microsoft Windows operating system. In [20], CUDAshowed better performance than OpenCL. We prefer usingCUDA to OpenCL because of that result, and we run it inLinux operating system. Therefore, the GPU architecture thatis explained in this section is referred to NVIDIA GPU.In CUDA, computing system consists of CPU and one or

more GPUs. CUDA offers the ability to access and managethe computing process in GPU. We can set a number of GPUthreads and registers we need, and we have options in whichGPUmemory type we will store our data. GPU has several typesof memory with different size and accessing cost. CUDA alsohas some non-default features that can be used to improve thecomputing performance. In this paper, we explore some of thosefeatures and apply them to our codes. Since the performancenot only depends on the number of cores, it is important for adeveloper to understand the architecture of GPU, computing re-sources and process in GPU, and non-default features in CUDA.The performance will also depend on the parallelization of ourproblem. In this work, we set the parallelization based on thecomputational domain.A CUDA program supplies a single source code encom-

passing both CPU and GPU. Basically, it has three phases:transmission data from CPU RAM to GPU RAM, execution ofCUDA kernel by GPU, and transmission data back to the CPURAM. In general, CUDA has two groups of memory: on-chipmemory and DRAMmemory. On-chip memory contains sharedmemory, registers, and L1 cache. The latest NVIDIA GPUsalso have L2 cache. Accessing on-chip memory is faster thanDRAM memory. In DRAM memory, there are global memory,constant memory, and texture memory. Constant and textures

1018 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 3, JUNE 2012

TABLE IISPECIFICATION OF NVIDIA FERMI C2050

memories are cached in the L1 cache. Therefore, in somecases, accessing constant or texture memory is cheaper thanaccessing the global memory. The latest NVIDIA GPUs thathave L2 cache may configure L2 to cache the global memoryby setting optional flag during compilation of CUDA program.This feature may increase the performance significantly.CUDA compiler, so-called NVCC, separates the program

into CPU codes and GPU codes. Codes with straight ANSIC will run as an ordinary process and be compiled by CPU’sstandard C compiler. For codes written using ANSI C extendedwith keywords for labeling data-parallel functions, calledkernel, will be compiled by NVCC itself and then executedin GPU. When a kernel is invoked, a number of threads willbe generated in GPU. Threads are organized as a grid ofthread-blocks, and each thread will execute the same kernelcode. The thread-blocks will then be distributed to availablecomputing resources or multiprocessors (MP) in GPU. Thedistribution of thread-blocks will consider the capability andspecification of GPU hardware. Limited capacities of registersand shared memory in MP will define how many blocks canbe handled by one MP. Threads of a thread-block are thenexecuted concurrently within a group of 32 threads calledwarp. Accessing global memory by threads in a half-warp canbe coalesced into a single memory transaction of 32, 64, or128 bytes. To achieve coalesced memory access, threads in ahalf-warp must access the 16 words of a segment in sequence. Itmeans the th thread in the half-warp must access the th wordin a segment. Understanding of thread-blocks configurationand memory management is required in order to develop anapplication in CUDA. More details of CUDA programming areexplained well in [21], [22], and [23].It is important to see the specification of GPU to know the

computing power in our GPU. In our work, we use an Intel XeonE5620 2.4 GHz PC equipped with an NVIDIA Fermi C2050GPU in 64-bit Linux operating system with CUDA version 4.0.Table II shows the specification of Tesla C2050.

Fig. 9. The visualized image of our previous work showing tsunami propaga-tion.

Fig. 10. The visualized image of our previous work showing tsunami inunda-tion over an area of the north of Sumatera Island, indicated by the red circle.

V. GPU IMPLEMENTATION

In our previous work [1], we used shader programmingtechnique and utilize the GPU texture memories and framebuffer object extension for I/O communication between CPUand GPU. We stored our initial data as textures in GPU memoryand wrote the computing code as shader programs. With thistechnique, for 1000 time steps, we achieved 2.2x speedupcompared to the C implementation on an Intel Core 2 Duo @1.8 GHz laptop equipped with GeForce 8400M G graphicscard with 128 MB of video memory. Figs. 9 and 10 are thevisualization of our previous work on GPU-based tsunami sim-ulation using shader programming technique. Fig. 9 shows thepropagation of the tsunami, while Fig. 10 shows the inundationzone over an area of the north of Sumatera Islands.For CUDA implementation, we remove the visualization part

since we want to measure the exact execution time of the com-puting parts. This section shows the implementation of Tsunamipropagation model using CUDA on a Fermi generation GPU.Detail of each step taken for the performance improvement isexplained in this section as well. This will help in understandingthe difference between Fermi generation and earlier generation.In designing CUDA implementation, we should look at

the three phases that are included in a CUDA program: datatransmission from CPU to GPU, execution the kernel code,and output transmission from GPU to CPU. In data transmis-sion from GPU to CPU phase, we should choose the suitablememory for data storage in GPU based on its usage frequencyin the kernel code. For our case, initial and are storedin global memory, while constant variables are sent to constantmemory. In early development of our CUDA code, topographydata and bed slope data are stored in texture memory since theyare read only data. In early GPU generation, texture memoryis faster than global memory because of texture cache in mul-tiprocessor. A new feature regarding global memory in Fermigeneration will be explained later.

SATRIA et al.: GPU ACCELERATION OF TSUNAMI PROPAGATION MODEL 1019

Fig. 11. Illustration of domain for CUDA implementation.

Fig. 12. Illustration of shared memory arrays.

We also need shared memory arrays in order to minimizethe use of global memory arrays. The size of shared memorywill depend on the dimension of thread-block and be defined inkernel code. Shared memory arrays will also have halo regionin each edge to accommodate neighbor data that are needed inpredictor and corrector steps. Since it depends on the thread-block dimension, shared memory will consider the tessellationof computational domain. In our case, we divide the compu-tational domain into pieces called tile as illustrated in Fig. 11.Each tile will be handled by a thread-block, and each node in atile will be processed by a thread. Considering the size of warpand global memory access, we set 16 12 1 for thread-blockdimension. This means there are 6 warps in a thread-block andthe dimension of grid is 5 15 1. With an additional halo re-gion, the size of shared memory arrays becomes 18 14 1 asillustrated in Fig. 12.The process in moving data to the halo region should con-

sider coalesced memory access. Since we store single preci-sion (float 32-bit) data array, a data segment in global memorywill be in 16 words order. The segment will provide a single64-bytes memory transaction for coalesced global memory ac-cess. One of conditions to satisfy the coalesced global memoryaccess is that threads in a half-warp must access the 16 wordsin sequence: the th thread in the half-warp must access the thword in a segment. We refer readers to [22] for details of co-alesced global memory access. In order to accommodate thiscondition, in the predictor step, we set thread 0 to handle theright halo region of the shared memory array. Accessing globalmemory then becomes coalesced since it accesses the word 0 ofanother segment. This is illustrated in Fig. 13 and also expressedat lines 16–18 in Fig. 14.Regarding the kernel code execution phase, in this early de-

velopment, we write 4 kernel codes to represent all computingsteps in Fig. 7. They are wave generation kernel, predictorkernel, corrector kernel, and filtering kernel. Accessing bedslope data in texture memory is expressed at line 24 in Fig. 15.Lines 27–29 in that figure express the equations (9)–(11).Similar way is also applied in corrector kernel code. Fig. 16

Fig. 13. Illustration of coalesced memory access in the predictor step.

Fig. 14. Snipped kernel code of the predictor step showing coalesced memoryaccess.

Fig. 15. Snipped kernel code of the predictor step expressed (9)–(11).

shows looping part in main function. Lines 4–7 in Fig. 16invoke the computing steps in order. Output transmissionphase is written as line 8 in Fig. 16. The performance of thisearly CUDA code is shown in Table III. Values in speedup

1020 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 3, JUNE 2012

Fig. 16. Looping part in main function.

TABLE IIIEARLY CUDA CODE PERFORMANCE

column are calculated by comparing the execution time of Cimplementation in Table I with the execution time of CUDAcode in corresponding number of time steps.We thenmake somemodifications in our CUDA code in order

to get better performance. First, we try to compose all com-puting steps into single kernel code. In CUDA, kernel invoca-tion is done by code that is executed by CPU. Therefore, in-voking several kernels will consume more times, and we tryto reduce this consumption. The improved version of CUDAcode is shown in Fig. 17. We keep the output of predictor stepin shared memory (line 30–32) and then synchronize threads insame thread blocks (line 33) in order to avoid readers-writersproblems. Before computing the corrector step, boundary andshared memory halo regions need to be updated. Lines 41–43express the equations (12)–(14).Fig. 18 shows single kernel invocation in main function.

Table IV shows the performance of our improved CUDA code.Values in speedup column are calculated by comparing theresult in Table I with current result in corresponding number oftime steps.In early generation GPU, accessing texture memory is faster

than accessing global memory and it is often used to store readonly data. Fermi generation GPU then comes with L2 cachefeature which has bigger size than L1 cache. By setting L2 tocache the global memory, accessing global memory becomesfaster than accessing texture memory. This feature is activatedby declaring flag during compilation. We thentake this advantage to improve the performance. We move to-pography and bed slope data to global memory instead.However, there are also some features in Fermi generation

GPU those can make our CUDA code becomes slower. In Fermigeneration GPU, by default, CUDA compiler generates IEEEcompliant code. This means it will give more precision to theresult but make execution time becomes slightly slower. NVCCprovides some additional flags which can be used to generatecode that is closer to the code generated for earlier genera-tion GPU: (denormalized numbers are flushed to

Fig. 17. The improved version of CUDA code.

Fig. 18. Single kernel invocation in main function.

TABLE IVUPDATED PERFORMANCE—SINGLE KERNEL CODE

TABLE VUPDATED PERFORMANCE—L2 CACHE AND LESS PRECISION FLAGS

zero), (less precise division), and(less precise square root). This way tends to have

better performance than using default settings [24]. It does notfully follow the IEEE compliance, but the numerical differenceis negligibly small and meets our single precision need. There-fore, we also take this advantage to improve the performance.Table V shows the updated performance for current setting.

SATRIA et al.: GPU ACCELERATION OF TSUNAMI PROPAGATION MODEL 1021

Fig. 19. Snipped code that describes the use of CUDA stream and event in ourCUDA implementation.

Fig. 20. Pipelining for kernel excution (blue) and asynchronous copy outputdata back to CPU memory (green).

In Fermi generation GPU, CUDA allows developer tomanage the sequence process in GPU through streams andCUDA event. Stream is a sequence of commands that executein order. CUDA event can be used to schedule the execution ofstreams. Overlapping kernel execution with asynchronous datatransfer may reduce the total execution time. In our work, wetake the advantage of this feature when collecting the result ofeach time step.Fig. 19 shows how we use CUDA stream and event in our

CUDA implementation. We set kernel execution in stream 0(line 3). After that, we record event 0 for stream 0 (line 4), andthen invoke cudaStreamWaitEvent for stream 1 to wait event 0(line 5). Once event 0 is complete, CUDA will automaticallyinvoke the stream 1 (line 6). Stream 1 contains a command forasynchronous copy output data back to CPU memory.Fig. 20 illustrates execution streams in CPU. Stream 0 (blue)

will execute our kernel. Stream 1 (green) will wait for the stream0 to complete and then process the asynchronous copy outputdata. In the meantime, stream 0 does not need to wait stream 1to complete. Stream 0 can directly continue the kernel execu-tion for next time step. Therefore, we can hide execution timefor asynchronous copy behind next time step’s execution. Ex-ecution time for asynchronous copy is only counted one timein the final time step. With this approach we can achieve out-standing performance as shown in Table VI.

Fig. 21. Speedups in final CUDA implementation.

TABLE VIUPDATED PERFORMANCE—CUDA STREAM AND EVENT

We record results for two different experiments. In first ex-periment, we collect and store the result of each time step as ex-plained in this section. This means there is I/O transmission ineach time step. In second experiment, we only collect and storethe result of final time step. Fig. 21 summarizes the performancefor both experiments. The chart shows that for 1000 time steps,we can achieve 223 times speedup for experiment with data I/Oin each time step and 818 times speedup for experiment withI/O in final time step only. Our GPU has a total of 448 cores,but we were able to achieve higher speedups via CUDA. Theadvantage of using CUDA is to have the ability to access andmanage the computing process in GPU. One can set how manythreads and registers needed, and can set in which memory tostore data since GPU has several types of memory with differentaccessing costs. Thus, the speedup does not just depend on thenumber of GPU cores, but also on better utilization of the archi-tecture and all the computing resources in GPU. It is importantto highlight the speedup achieved with respect to the single CPUcase, owing to the many advantages and features offered by theGPU approach, some of them are mentioned in Section IV. Itis a common practice in the GPU HPC literature to compareGPU speedup with respect to one CPU core single-threadedcounterpart. The CPU multi-core hyper-threading performancevia OpenMP is linear at best with respect to one CPU coresingle-threaded counterpart. Thus, researchers are easy to esti-mate the GPU speedup with respect to the CPU multi-cores andhyper threading.

VI. CONCLUSION

In this paper, we have shown the CUDA implementation oftsunami propagation model with the MacCormack scheme run-ning on an NVIDIA Fermi C2050 GPU with 448 cores. TheMacCormack scheme is a two-step explicit method that appliesforward difference in predictor step and backward differencein corrector step. Therefore, this scheme is potentially paral-lelizable, and here, we utilize the parallelism resources pro-vided in a graphics processing unit device. We also have dis-cussed and explained all the improvement steps and optimiza-tion techniques taken that help our CUDA code getting better

1022 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 3, JUNE 2012

performance in Fermi generation GPU. We suggest attemptingto minimize the number of kernel functions as they will con-sume more times when being invoked. We also explain how totake the advantage of new optimization flags in Fermi genera-tion GPU: allowing L2 to cache global memory, and use lessnumerical precision instead. The most interesting part is howwe use CUDA stream and event features to overlap kernel exe-cution with asynchronous data transfer. This will hide the exe-cution time of asynchronous copy from GPU to CPU behind thenext time step’s kernel execution. Our final code achieves 223xspeedup for experiment with data I/O in each time step and 818xspeedup for experiment with I/O in final time step only, com-pared to original C code. TheMacCormack scheme is a classicalshock-capturing method that can be used to solve the shallowwater equations. Future work will include GPU accelerationof other modern shock-capturing methods such as MonotonicUpstream-centered Schemes for Conservation Laws (MUSCL)[25], Total Variation Diminishing schemes (TVD) [26], Essen-tially Non-Oscillatory schemes (ENO) [27], and Piecewise Par-abolic Method (PPM) [28].

REFERENCES

[1] W.-Y. Liang, T.-J. Hsieh, M. T. Satria, Y.-L. Chang, J.-P. Fang, C.-C.Chen, and C.-C. Han, “GPU-based simulation of tsunami propagationand inundation,” in Algorithms and Architectures for Parallel Pro-cessing. Heidelberg, Berlin: Springer Verlag, 2009, pp. 593–603,Lecture Notes in Computer Science 5574.

[2] C. A. Lee, S. D. Gasster, A. Plaza, C.-I. Chang, and B. Huang, “Re-cent developments in high performance computing for remote sensing:A review,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. (JS-TARS), vol. 4, no. 3, p. 508, Sep. 2011.

[3] E. Christophe, J. Michel, and J. Inglada, “Remote sensing processing:From multicore to GPU,” IEEE J. Sel. Topics Appl. Earth Observ. Re-mote Sens. (JSTARS), vol. 4, no. 3, p. 643, Sep. 2011.

[4] C.-C. Chang, Y.-L. Chang, M.-Y. Huang, and B. Huang, “Acceleratingregular LDPC code decoders on GPUs,” IEEE J. Sel. Topics Appl.Earth Observ. Remote Sens. (JSTARS), vol. 4, no. 3, p. 653, Sep. 2011.

[5] S.-C. Wei and B. Huang, “GPU acceleration of predictive partitionedvector quantization for ultraspectral sounder data compression,” IEEEJ. Sel. Topics Appl. Earth Observ. Remote Sens. (JSTARS), vol. 4, no.3, p. 677, Sep. 2011.

[6] C. Song, Y. Li, and B. Huang, “A GPU-accelerated wavelet decom-pression system with SPIHT and Reed-Solomon decoding for satelliteimages,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. (JS-TARS), vol. 4, no. 3, p. 683, Sep. 2011.

[7] J. Mielikainen, B. Huang, and H.-L. A. Huang, “GPU-acceleratedmulti-profile radiative transfer model for the infrared atmosphericsounding interferometer,” IEEE J. Sel. Topics Appl. Earth Observ.Remote Sens. (JSTARS), vol. 4, no. 3, p. 691, Sep. 2011.

[8] W.-Y. Tan, Shallow Water Hydrodynamics: Mathematical Theory andNumerical Solution for Two-Dimensional System of Shallow WaterEquations. Amsterdam: Elsevier Science Publishers, 1992.

[9] T. R. Hagen, J. M. Hjelmervik, K. A. Lie, J. R. Natvig, and M. O. Hen-riksen, “Visual simulation of shallow water waves,” Simulation Mod-elling Practice and Theory, vol. 13, no. 8, pp. 716–726, 2005.

[10] M. Kass and G. Miller, “Rapid, stable fluid dynamics for computergraphics,” ACM SIGGRAPH Computer Graphics, vol. 24, no. 4, pp.49–57, 1990.

[11] A. T. Layton and M. van de Panne, “A numerically efficient and stablealgorithm for animating water waves,” The Visual Computer, vol. 18,no. 1, pp. 41–53, 2002.

[12] T. T. Kuo and Z. C. Shih, “The simulation of tsunami wave propaga-tion,” in IEEE Int. Symp. Multimedia Workshops, 2007, pp. 3–8.

[13] O. N. Karsten and P. Trier, “Implementing Rapid, StableFluid Dynamics on the GPU,” 2004 [Online]. Available:http://projects.n-o-e.dk/GPUwatersimulation/gpu-water.pdf

[14] A. R. Brodtkorb, T. R. Hagen, K.-A. Lie, and J. R. Natvig, “Simu-lation and Visualization of the Saint-Venant system using GPUs,” inJournal Computing and Visualization in Science. Heidelberg, Berlin:Springer-Verlag, Oct. 2010, vol. 13.

[15] R. W. MacCormack, “The effect of viscosity in hypervelocity impactcratering,” in Frontiers of Computational Fluid Dynamics 2002, D. A.Caughey and M. M. Hafez, Eds. , Singapore: World Scientific, 2002,pp. 27–43.

[16] J. C. Tannehill, D. A. Anderson, and R. H. Pletcher, ComputationalFluid Dynamics and Heat Transfer, 2nd ed. New York: Taylor &Francis, 1997.

[17] J. G. Zhou, Lattice BoltzmanMethods for ShallowWater Flows. NewYork: Springer, 2004.

[18] V. T. Chow, Open Channel Hydraulics. New York: McGraw‐Hill,1959.

[19] Z. Kowalik and T. S. Murty, Numerical Modeling of Ocean Dynamics.Advance Series on Ocean Engineering. , Singapore:World Scientific,1993.

[20] K. Karimi, N. G. Dickson, and F. Hamze, “A Performance Compar-ison of CUDA and OpenCL,” [Online]. Available: http://arxiv.org/pdf/1005.2581

[21] J. Sanders and E. Kandrot, CUDA by Example: An Introduction toGeneral-Purpose GPU Programming, 1st ed. New York: Addison-Wesley Professional, Jul. 2010.

[22] “NVIDIA,” NVIDIA CUDA C Programming Guide, Jun. 2011.[23] “NVIDIA,” CUDA C Best Practices Guide, May 2011.[24] “NVIDIA,” Tuning CUDA Application for Fermi, May 2011.[25] B. van Leer, “Towards the ultimate conservative difference scheme V;

A second-order sequel to godunov’s sequel,” J. Comput. Phys., vol. 32,pp. 101–136, 1979.

[26] A. Harten, “High resolution schemes for hyperbolic conservationlaws,” J. Comput. Phys., vol. 49, pp. 357–393, 1983.

[27] A. Harten, B. Engquist, S. Osher, and S. R. Chakravarthy, “Uni-formly high order accurate essentially non-oscillatory schemes III,” J.Comput. Phys., vol. 71, pp. 231–303, 1987.

[28] P. Colella and P. Woodward, “The piecewise parabolic method (PPM)for gasdynamical simulations,” J. Comput. Phys., vol. 54, pp. 174–201,1984.

Muhammad T. Satria received the B.S. degree inmathematics from Institut Teknologi Bandung andthe M.S. degree in computer engineering from theNational Taipei University of Technology in 2010.He is currently a research intern at the Space

Science and Engineering Center, the University ofWisconsin-Madison. He is interested in high-per-formance computing, parallel computing, andGPU-based computing.

Bormin Huang received the M.S.E. degree inaerospace engineering from the University ofMichigan, Ann Arbor, and the Ph.D. in the areaof satellite remote sensing from the University ofWisconsin-Madison.He was with NASA Langley Research Center

during 1998–2001 for the NASA New Millen-nium Program’s Geosynchronous Imaging FourierTransform Spectrometer (GIFTS). He is currentlya research scientist and principal investigator atthe Space Science and Engineering Center at the

University of Wisconsin-Madison, where he advises and supports both nationaland international M.S. and Ph.D. students and visiting scientists. He has au-thored or coauthored over 100 scientific and technical publications, includingthe book Satellite Data Compression (Springer, 2011). He has broad interestand experience in remote sensing science and technology, including satellitedata compression and communications, remote sensing image processing,remote sensing forward modeling and inverse problems, and high-performancecomputing in remote sensing.Dr. Huang has served as a Chair for the SPIE Conference on Satellite Data

Compression, Communications, and Processing since 2005, a Chair for the SPIEEurope Conference on High-Performance Computing in Remote Sensing since2011, and a Chair for the 2011 IEEE International Workshop on Parallel andDistributed Computing in Remote Sensing. He also serves as an Associate Ed-itor for the Journal of Applied Remote Sensing, an Editor for the Journal ofGeophysics and Remote Sensing and a Guest Editor for the special section onHigh-Performance Computing in the Journal of Applied Remote Sensing. Hehas been a Session Chair or Program Committee member for several IEEE orSPIE conferences.

SATRIA et al.: GPU ACCELERATION OF TSUNAMI PROPAGATION MODEL 1023

Tung-Ju Hsieh received the Ph.D. in electrical andcomputer engineering from the University of Cali-fornia, Irvine, in 2006.He is an Assistant Professor in the Department

of Computer Science and Information Engineering,National Taipei University of Technology. Prior tohis current role, he was a postdoc at the CaliforniaInstitute for Telecommunications and InformationTechnology (Calit2). His research interests and re-search topics include high-performance computing,parallel visualization, scientific visualization, and

computer graphics. His research is aimed at using real-time visualizationtechnology to explore massive three-dimensional scientific simulation andsensing data.

Yang-Lang Chang received the B.S. degree in elec-trical engineering from Chung Yuan Christian Uni-versity, Taiwan, in 1987, theM.S. degree in computerengineering from Syracuse University, Syracuse, NY,in 1993, and the Ph.D. degree in computer scienceand information engineering from the National Cen-tral University, Taiwan, in 2003.He started his career with NDC IBM Taiwan

as a Hardware Design Engineer before joiningALCATEL as a Software Development Engineer,and presently is an Associate Professor in the De-

partment of Electrical Engineering, National Taipei University of Technology.His research interests are in remote sensing, high performance computing andhyperspectral image analysis.Dr. Chang is a Senior Member of IEEE, a Member of SPIE, Phi Tau Phi

Scholastic Honor Society, the Chinese Society of Photogrammetry and RemoteSensing, and the Chinese Society of Image Processing and Pattern Recogni-

tion. He has been a Conference Program Committee member and Session Chairfor several international conferences. He is a member of the Editorial AdvisoryBoard of Open Remote Sensing Journal. He served as the Guest Editor for thespecial issue onHigh Performance Computing in Earth Observation and RemoteSensing in the IEEE Journal of Selected Topics in Applied Earth Observationsand Remote Sensing (JSTARS). Currently, he serves as the Chapter Chair of theTaipei Chapter of the IEEE Geoscience and Remote Sensing Society. He wasthe recipient of the Best Reviewer Award of the IEEE JSTARS in 2010.

Wen-Yew Liang received the B.S. degree in com-puter science and engineering from Tatung Instituteof Technology, Taiwan, in 1992, the M.S. degree incomputer science from National Tsing Hua Univer-sity, Taiwan, in 1994, and the Ph.D. degree in com-puter science and information engineering from Na-tional Taiwan University in 1998.From 1998 to 2000, he served as a communica-

tion officer in the Department of Defense, Taiwan. Heworked for Avant Corporation as an EDA softwareengineer from 2000 to 2001. From 2001 to 2004, he

joined an embedded system design company, WISCORE Inc., and experiencedthe positions of a R&D staff engineer, the department manager, and the vicepresident. From 2004 to 2005, he transferred to St. John’s University as anAssistant Professor in the Department of Computer Science and InformationEngineering. Currently, he is an Assistant Professor in the Department of Com-puter Science and Information Engineering, National Taipei University of Tech-nology. His research interests include embedded systems, low power systemdesign, parallel and distributed systems, mobile computing, and parallelizationfor geosciences and other scientific applications. In addition to this research, healso devotes himself to the development, promotion, and professional educationof Android and Linux systems.Dr. Liang is a member of ACM, IEEE Computer Society, and IICM.