Parallel SOR for Solving the Convection Diffusion Equation Using GPUs with CUDA

11
Parallel 5 point SOR for solving the Convection Diusion equation using graphics processing units Y. Cotronis, E. Konstantinidis, M. A. Louka, N. M. Missirlis Department of Informatics and Telecommunications, University of Athens, Panepistimiopolis, 15784, Athens, Greece Abstract In this paper we study a parallel form of the SOR method for the numerical solution of the Convection Diusion equation suitable for GPUs using CUDA. To exploit the parallelism oered by GPUs we consider the fine grain parallelism model. This is achieved by considering the local relaxation version of SOR. More specifically, we use SOR with red black ordering with two sets of parameters ω ij and ω ij for the 5 point stencil. The parameter ω ij is associated with each red (i+j even) grid point (ij), whereas the parameter ω ij is associated with each black (i+j odd) grid point (ij). The use of a parameter for each grid point avoids the global communication required in the adaptive determination of the best value of ω and also increases the convergence rate of the SOR method [2]. We present our strategy and the results of our eort to exploit the computational capabilities of GPUs under the CUDA environment. Additionally, a parallel program utilizing manual SSE2 (Streaming SIMD Extensions 2) vectorization for the CPU was developed as a performance reference. The optimizations applied on the GPU version were also considered for the CPU version. Significant performance improvement was achieved with the three developed GPU kernel variations. Keywords: Iterative methods, SOR, R/B SOR, GPU computing, CUDA Subject classification : AMS(MOS), 65F10, 65N20, CR:5.13. 1. Introduction Although, typically the CPUs of computer systems have been used to solve computational problems, current trend is to ooad heavy computations to accelerators and particularly graphics processing units (GPUs). As commodity GPUs have increased computational power compared to modern CPUs they are proposed as a more ecient compute unit in solving scientific problems with large computational burden. Thus, application programming environments have been developed like the proprietary CUDA (Compute Unified Development Architecture) by NVidia [17, 13] and the OpenCL (Open Computing Language) [22] which is supported by many hardware vendors, including NVidia. CUDA environment is rapidly evolving and a constantly increasing number of researchers is adopting it in order to exploit GPU capabilities. It provides an elegant way for writing GPU parallel programs, by using similar syntax and rules based on C/C++ language, without involving other graphics APIs. In this paper we use GPUs for the numerical solution of Partial Dierential equations. In particular, we consider the solution of the second order convection diusion equation Δu f ( x, y) u x g( x, y) u y = 0 (1) Corresponding author Email addresses: [email protected] (Y. Cotronis), [email protected] (E. Konstantinidis), [email protected] (M. A. Louka), [email protected] (N. M. Missirlis) Preprint submitted to Elsevier October 30, 2012

Transcript of Parallel SOR for Solving the Convection Diffusion Equation Using GPUs with CUDA

Parallel 5 point SOR for solving the Convection Diffusion equation using

graphics processing units

Y. Cotronis, E. Konstantinidis, M. A. Louka, N. M. Missirlis

Department of Informatics and Telecommunications, University of Athens,

Panepistimiopolis, 15784, Athens, Greece

Abstract

In this paper we study a parallel form of the SOR method for the numerical solution of the Convection Diffusion

equation suitable for GPUs using CUDA. To exploit the parallelism offered by GPUs we consider the fine grain

parallelism model. This is achieved by considering the local relaxation version of SOR. More specifically, we use

SOR with red black ordering with two sets of parameters ωi j and ω′

i jfor the 5 point stencil. The parameter ωi j is

associated with each red (i+j even) grid point (ij), whereas the parameter ω′

i jis associated with each black (i+j odd)

grid point (ij). The use of a parameter for each grid point avoids the global communication required in the adaptive

determination of the best value of ω and also increases the convergence rate of the SOR method [2]. We present our

strategy and the results of our effort to exploit the computational capabilities of GPUs under the CUDA environment.

Additionally, a parallel program utilizing manual SSE2 (Streaming SIMD Extensions 2) vectorization for the CPU

was developed as a performance reference. The optimizations applied on the GPU version were also considered for

the CPU version. Significant performance improvement was achieved with the three developed GPU kernel variations.

Keywords: Iterative methods, SOR, R/B SOR, GPU computing, CUDA Subject classification : AMS(MOS), 65F10,

65N20, CR:5.13.

1. Introduction

Although, typically the CPUs of computer systems have been used to solve computational problems, current trend

is to offload heavy computations to accelerators and particularly graphics processing units (GPUs). As commodity

GPUs have increased computational power compared to modern CPUs they are proposed as a more efficient compute

unit in solving scientific problems with large computational burden. Thus, application programming environments

have been developed like the proprietary CUDA (Compute Unified Development Architecture) by NVidia [17, 13]

and the OpenCL (Open Computing Language) [22] which is supported by many hardware vendors, including NVidia.

CUDA environment is rapidly evolving and a constantly increasing number of researchers is adopting it in order to

exploit GPU capabilities. It provides an elegant way for writing GPU parallel programs, by using similar syntax and

rules based on C/C++ language, without involving other graphics APIs. In this paper we use GPUs for the numerical

solution of Partial Differential equations.

In particular, we consider the solution of the second order convection diffusion equation

∆u − f (x, y)∂u

∂x− g(x, y)

∂u

∂y= 0 (1)

∗Corresponding author

Email addresses: [email protected] (Y. Cotronis), [email protected] (E. Konstantinidis), [email protected] (M. A. Louka),

[email protected] (N. M. Missirlis)

Preprint submitted to Elsevier October 30, 2012

on a domainΩ = (x, y)|0 ≤ x ≤ 1, 0 ≤ y ≤ 1, where u = u(x, y) is prescribed on the boundary ∂Ω. The discretization

of (1) on a rectangular grid M1 × M2 = N unknowns within Ω leads to

ui j = ℓi jui−1, j + ri jui+1, j + ti jui, j+1 + bi jui, j−1, (2)

i = 1, 2, . . . ,M1 , j = 1, 2, . . . ,M2

with

ℓi j =k2

2(k2 + h2)(1 +

1

2h fi j) , ri j =

k2

2(k2 + h2)(1 − 1

2h fi j)

(3)

ti j =h2

2(k2 + h2)(1 − 1

2kgi j) , bi j =

h2

2(k2 + h2)(1 +

1

2kgi j),

where h = 1/(M1 + 1), k = 1/(M2 + 1), fi j = f (ih, jk) and gi j = g(ih, jk). For a particular ordering of the grid points

(2) yield a large, sparse, linear system of equations of order N of the form

Au = b. (4)

The Successive Overrelaxation (SOR) iterative method, which is given by the form

u(n+1)

i j= (1 − ω)u

(n)

i, j+ ω(ℓi ju

(n+1)

i−1, j+ ri ju

(n)

i+1, j+ ti ju

(n)

i, j+1+ bi ju

(n+1)

i, j−1) (5)

is an important solver for large linear systems [15], [16]. It is also a robust smoother as well as an efficient solver of

the coarsest grid equations in the multigrid method. However, the SOR method is essentially sequential in its original

form. Several parallel versions of the SOR method have been studied by coloring the grid points [1], [14].

In order to use a parallel form of the SOR method with fine grain parallelism we have to color the grid points red-black

[1], [14] so that sets of points of the same color can be computed in parallel. However, the parameter ω which acceler-

ates the rate of convergence of SOR is computed adaptively in terms of u(n+1) and u(n) [8]. This computation requires

global communication between the processors for each iteration which means O(√

N) communication complexity for

a mesh connected√

N ×√

N array of processors. As the number of iterations for SOR is proportional to O(√

N) [16],

[8], it follows that its execution time is O(N) again. To overcome this problem local relaxation methods are used [5],

[6], [11]. In these methods each point in the grid has its own relaxation parameter ωi j which is determined in terms

of the local coefficients of the PDE. In [3], [5], [6], the local SOR (LSOR) with different formulas for the optimum

values of the relaxation parameters was studied numerically and compared with the classic SOR method for the 5-

point stencil. It was found that LSOR possesses better convergence rate than SOR using only local communication.

However, the first theoretical results about the convergence of LSOR were presented in [11] under the assumption

that the coefficient matrix is symmetric and positive definite. Following a similar approach but using two different

sets of parameters ωi j and ω′i j

for the red and black points, respectively, it was proved in [2] that the local Modified

SOR method (LMSOR) possesses a better rate of convergence than LSOR for the 5-point stencil. This comparison

was carried out in case the eigenvalues of the Jacobi matrix possesses either real (real case) or imaginary (imaginary

case) eigenvalues.

The SOR method has been implemented on GPUs as applied to medical analysis [7] as well as to computational fluid

dynamics [9] problems.

Our contribution is to explore the LMSOR method for the 5-point stencil exploiting the computational capabilities of

GPUs under the CUDA environment.

In our study we used three different techniques to find the one that best exploits the capabilities of the GPU.

This paper extends an already published work on applying the red black SOR method on GPUs in solving the Convec-

tion Diffusion equation[4]. In this work the derivation of the optimum local relaxation factors are better analysed, the

GPU kernels have been more fine tuned and the CPU implementation includes a parallel OpenMP version utilizing

SSE2 manual vectorization. Some various optimizations that had already been applied to the GPU kernels are proved

to be beneficial to the CPU implementation as well.

The remainder of the paper is organized as follows. In section 2 we present a general description of the LMSOR

method for the 5 point stencil. In section 3 we present its implementation in GPUs, in section 4 we present our

performance results and finally, in section 5, we state our remarks and conclusions.

2

2. The 5-pt local Modified SOR method

The LSOR method was introduced by Ehrlich [5], [6] and Botta and Veldman [3] in an attempt to further increase

the rate of convergence of SOR. The idea is based on letting the relaxation factor ω vary from equation to equation.

This means that each equation of (2) has its own relaxation parameter denoted by ωi j. Kuo et. al [11] combined LSOR

with red black ordering and showed that is suitable for parallel implementation on mesh connected processor arrays.

In [2] we generalized LSOR by letting two different sets of parameters ωi j, ω′

i jto be used for the red (i + j even) and

black (i + j odd) points, respectively.

Let us assume that A is a N × N matrix with nonvanishing diagonal elements possessing the splitting

A = D −CL −CU (6)

where D is the diagonal part of A and −CL,−CU are the lower and upper triangular parts of A, respectively. For the

numerical solution of (4) we define the local Modified SOR method as follows

u(n+1) = (I −Ω)u(n) + Ω(Lu(n+1) + Uu(n) + c) (7)

where Ω = diag(ω1, ω2, . . . , ωN), ωi, i = 1, 2, . . . ,N are real numbers,

L = D−1CL, U = D−1CU and c = D−1b. (8)

Alternatively, (7) takes the form

u(n+1) = LΩu(n) + (I −ΩL)−1Ωc (9)

where

LΩ = I − (I −ΩL)−1ΩD−1A. (10)

Note that if Ω = ωI, then (7) reduces to the classical SOR method. Next, we study the local SOR method with red

black ordering. In this case the matrix A possesses the form [16]

A =

(

DR H

K DB

)

(11)

where H is an m × r submatrix, K = HT and DR,DB are m × m and r × r diagonal submatrices, respectively with

m + r = N. For the partitioning

D =

(

DR 0

0 DB

)

, CL =

(

0 0

−K 0

)

, CU =

(

0 −H

0 0

)

(12)

u = (uR, uB)T and b = (bR, bB)T , where uR, bR are m × 1 vectors and ub, bB are r × 1 vectors. It is readily verified that

R = I −ΩL =

(

IR 0

−ΩBFB IB

)

(13)

where

Ω =

(

ΩRIR 0

0 ΩBIB

)

, (14)

ΩR,ΩB are m × m and r × r diagonal submatrices which correspond to the red and black grid points, respectively and

FB = −D−1B

K. Thus, the local SOR method (see (9)), because of (13) and (14) takes the form

(

u(n+1)

R

u(n+1)

B

)

=

(

u(n)

R

u(n)

B

)

+

(

IR 0

−ΩBFB IB

)−1 (

ΩR 0

0 ΩB

) [(

cR

cB

)

−(

IR −FR

−FB IB

) (

u(n)

R

u(n)

B

)]

(15)

where FR = −D−1R

H, cR = D−1R

bR and cB = D−1B

bB. The matrix form of the local SOR method is given by

u(n+1) = LΩR,ΩBu(n) + kΩR,ΩB

, (16)

3

where

LΩR,ΩB=

(

IR 0

−ΩBFB IB

)−1 (

IR −ΩR ΩRFR

0 IB −ΩB

)

(17)

and

kΩR,ΩB=

(

ΩRcR

ΩBcB

)

. (18)

Writing (16) in a computable form we have

u(n+1)

R= (IR −ΩR)u

(n)

R+ ΩRFRu

(n)

R+ ΩRcR

u(n+1)

B= ΩBFBu

(n+1)

R+ (IB −ΩB)u

(n)

B+ ΩBcB. (19)

We will refer to the iterative scheme (19) as the local Modified SOR (LMSOR) method since if ΩR = ΩB, then (19)

reduces to the local SOR method [6], [11], [12]. Evidently if ΩR = ΩB = ωI, then (19) yields the classical SOR

method. Letting ΩR = ΩB = I, then (19) gives

u(n+1)

R= FRu

(n)

R+ cR

u(n+1)

B= FBu

(n+1)

B+ cB (20)

which is the Gauss–Seidel (GS) method with red–black ordering. The SOR method for the 5–point discretization of

(2) is given by (5). We can choose to call a grid point (i, j) as red when i + j is even and black when i + j is odd. For

the local MSOR method the formulas using the 5–point stencil are (see (19))

u(n+1)

i j= (1 − ωi j)u

(n)

i j+ ωi jJi ju

(n)

i j, for i + j even

u(n+1)

i j= (1 − ω′i j)u

(n)

i j+ ω

i jJi ju(n)

i j, for i + j odd (21)

where

Ji ju(n)

i j= li ju

(n)

i−1, j+ ri ju

(n)

i+1, j+ ti ju

(n)

i, j+1+ bi ju

(n)

i, j−1(22)

and Ji j is called the local Jacobi operator. The parameters ωi j, ω′

i jare called local relaxation parameters and (21) will

be referred to as the local Modified SOR (LMSOR) method. Note that if ωi j = ω′

i j, then (21) reduce to the LSOR

method studied in [11]. Moreover, if ωi j = ω′

i j= ω (21) degenerate into the classical SOR method with red black

ordering. Using Fourier analysis, Boukas and Missirlis [2] proved that the optimum values of the local relaxation

parameters ω1,i, j and ω2,i, j for the LMSOR method in case the eigenvalues µi j of the local Jacobi operator Ji j are all

real or all imaginary are the following.

Case 1: µi j are real. This case applies when ℓi jri j ≥ 0 and ti jbi j ≥ 0. The optimum values of the LMSOR parameters

are given by

ω1,i, j =2

1 − µi jµi j+

(1 − µ2i j)(1 − µ2

i j)

and (23)

ω2,i, j =2

1 + µi jµi j+

(1 − µ2i j)(1 − µ2

i j)

where

µi j = 2

(√

ℓi jri jcosπh +

ti jbi jcosπk

)

(24)

and

µi j= 2

(

ℓi jri jcosπ(1 − h)

2+

ti jbi jcosπ(1 − k)

2

)

. (25)

4

Note that µi j is the spectral radius of the local Jacobi operator Ji j where

µi j = 2

(

ℓi jri jcosk1π

M1 + 1+

ti jbi jcosk2π

M2 + 1

)

, (26)

with k1 = 1, 2, . . . ,M1, k2 = 1, 2, . . . ,M2, for periodic boundary conditions.

Case 2: µi j are imaginary. This case applies when ℓi jri j ≤ 0 and ti jbi j ≤ 0. The optimum values of the LMSOR

parameters are given by

ω1,i, j =2

1 − µi jµi j+

(1 + µ2i j)(1 + µ

2

i j)

and (27)

ω2,i, j =2

1 + µi jµi j+

(1 + µ2i j)(1 + µ

2

i j)

where µi j and µi j

are computed by (24) and (25), respectively.

3. Parallel Implementation

Implementations for the LMSOR method were developed for both the CPU and GPU. A parallel program for the

CPU was implemented as a performance reference utilizing the OpenMP[23] shared memory architecture program-

ming standard. The same program has been used as a sequential one, as well, by disabling the respective OpenMP

compiler flag. In contrast, the GPU version is a massively parallel program, in which fine grained parallelism is em-

ployed by assigning only few elements for computation per thread. The results are focused on comparing the CPU as

a multicore processor with the GPU as a manycore processor.

As the solution of the Laplace equation using the red/black SOR method is memory bound [10], this problem

can also be characterized as memory bound. Inspecting the computations by (21) and (22) we note that, in case of

good cache behavior, each element needs to access roughly 8 elements (accesses to ui, j count as 3, being 2 reads plus

1 write) per iteration (either red or black). The same computation reveals also that 11 floating point operations are

needed per element. Thus, the ratio of floating point operations per accessed elements is 11/8, and considering the

use of double precision arithmetic, the ratio of floating point operations per byte accesses is 11/(8 × 8) ≈ 0.17. This

ratio is quite low as the GPUs are suited for much higher compute intensity codes. For instance, the NVidia GTX480

has a double precision peak performance of 168 GFlops and memory throughput of 177 GB/sec. Thus, a balanced

algorithm should perform at least ≈ 1 double precision floating point operation per byte accessed. It should be noted

that double precision operations were applied in all developed programs of this work.

As our program implements a red/black ordering, it is beneficial to apply reordering by color strategy into sepa-

rate matrices in order to optimize performance by coalescing [10], [18]. Points in a mesh are split into two different

matrices, one for the red points and one for the black. This strategy can improve bandwidth utilization by improving

locality and coalescing of memory accesses and mostly, by utilizing all points contained in a memory segment, which

is not possible with a natural interleaved red/black ordering.

Moreover, our program utilizes 6 matrices during the computation procedure (u, ω, l, r, t and b) as formulae

(21) and (22) indicate, all of which feature a red/black ordering. As the reordering strategy can be applied on every

red-black ordered matrix it is possible to apply it on all 6 matrices, so accesses on all matrices can benefit from the

reordered memory structure. This factor raises the importance of the use of point reordering by color strategy.

It should be noted that all implementations perform convergence checking on every iteration which raises the ex-

ecution overhead. Convergence checking in the GPU kernel is implemented as a reduction of all computed maximum

values. On a production environment convergence checking should be minimized in order to attain peak performance.

In order to alleviate the high memory bandwidth requirements set by the program, an alternative approach will be

used. Some read-only matrices, having their elements computed during the program initialization, can be eliminated

by replacing accesses on them with computations in the GPU kernel. As the GPU has very high instruction throughput

5

capability this trade-off can be beneficial.

In summary, LMSOR was implemented in three variations as three different kernels. Each variation differs by the

amount of redundant computations it performs iteratively.

CPU implementation. The CPU version is an optimized implementation utilizing the OpenMP API in order to

exploit all available processing units and the core computation code is written by using SSE2 intrinsics by vectorizing

accesses and computations. Elements are processed in blocks of data assigned to threads and each thread computes

multiple values in pairs by using SIMD instructions. SSE2 instruction set permits double precision floating point

operations in couples.

In order to accomplish SSE vectorization the locality of accesses should be enforced. In this sense the reordering by

color strategy applied on the GPU kernels was also applied on the CPU versions. Thus, the memory accesses become

sequential and the application of vectorization is feasible.

Furthermore, a partial recomputation strategy similar to the one described below on GPU kernel #2 was applied. The

CPU compute potential is also higher than its memory bandwidth, therefore, through experimentation it was found

that the CPU program also benefits by recomputation.

GPU Kernel #1 - No redundant computations. All values required in (21) and (22) reside in matrices situated

in the GPU device memory (matrices corresponding to elements ui, j, ωi, j, li, j, ri, j, ti, j and bi, j). Beyond employing

the reordering by color strategy, this kernel is the natural outcome implementation as no extra computations are

performed. Thus, the 6 aforementioned matrices are required in this scheme and 8 element accesses per computed

element. Each GPU thread computes just two elements as this was found the best option through experimentation. As

previously shown, the ratio of floating point operations per byte accessed is 0.17, which is particularly low.

GPU Kernel #2 - Redundant computations of li, j, ri, j, t i, j, bi, j. The values of the two matrices fi, j and gi, j

multiplied by half the value of h, are precomputed and stored in two matrices ( f′

i, j= 1

2h fi j, g

i, j= 1

2kgi j) in the device

memory. The required terms (i.e. li, j, ri, j, ti, j and bi, j) are generated through f′

i, j, g′

i, jvalues during the SOR iterations.

Thus, instead of 4 matrices we just need to keep 2 matrices only in device memory. Memory requirements are lower

since only 4 matrices are required to reside in device memory (i.e. ui, j, ωi, j, fi, j and gi, j). However, it comes at a

cost of extra operations needed to recompute the required terms for the formula on every iteration. In this case, each

element requires 6 accesses and at least 11+ 4 = 15 floating point operations, as formula (3) indicates. Now, the ratio

is about 15/(6 × 8) = 0.31 flops per byte, which is a more balanced ratio but still less than 1.0. Additionally, each

GPU thread computes four elements and the kernel was manually unrolled.

GPU Kernel #3 - Redundant computations of li, j, ri, j, t i, j, bi, j, ωi, j. In this implementation, recomputation

is applied additionally to the ωi, j terms. Thus, in this case 5 accesses per computed element are required. A rough

estimate is that at least 15 + 24 = 39 flops are required. An approximation of the previous ratio is 39/(5 × 8) ≈ 0.98.

Thus, this kernel seems to be adequately balanced, as opposed to the previous kernels.

During computation ui, j, fi, j, gi, j terms are accessed from memory and all other terms are recomputed as required.

The chosen number of elements computed by each thread now is 10 which is much higher than the previous selected

ones. This was found again through experimentation.

The performance of the GPU version relies on the global memory cache present on Fermi GPU devices. As it has

been shown [10] the global memory cache can offer the potential of high performance without the need to utilize spe-

cial memory types (i.e. shared memory or texture memory). Additionally, it accomodates previously non-coalesced

memory accesses with spacial locality. Therefore, the application is not expected to run efficiently on older hardware,

i.e. GT-200 based GPUs. On such architectures, an alternative approach should have been chosen utilizing texture

memory or shared memory of the device.

4. Performance and numerical results

4.1. The 5 point LMSOR method

In order to test our theoretical results we considered the numerical solution of (1) with u = 0 on the boundary

of the unit square. The initial vector was chosen as u(0)(x, y) = xy(1 − x)(1 − y). The solution of the problem above

is zero. For the purpose of comparison we considered the application of LMSOR method with red black ordering,

on CPU and GPU. In all cases the iterative process was terminated when the criterion ||u(n) ||∞ ≤ 10−6 was satisfied.

6

Various functions for the coefficients f (x, y) and g(x, y) were chosen such that the eigenvalues µi j to be either real or

imaginary. The type of eigenvalues for each case is indicated by the tuple (# real, # imaginary) in the second row of

each table. The coefficients used in each problem are:

1. f (x, y) = Re(2x − 10)3, g(x, y) = Re(2y − 10)3

2. f (x, y) = Re(2x − 10), g(x, y) = Re(2y − 10)

3. f (x, y) = g(x, y) = Re · 104

where the Reynold operator Re = 10m, m = 0, 1, 2, 3 and 4.

All experiments were performed on a Linux environment. The CPU implementation was compiled with GCC version

4.7.1 on a 64bit environment, with all essential optimization flags enabled (-O2 -fomit-frame-pointer -ftree-vectorize

-msse2 -msse -funroll-loops -fassociative-math -fno-signed-zeros -fno-trapping-math -fno-signaling-nans). The GPU

implementation was compiled using CUDA Toolkit version 4.2 and GCC version 4.1.2 on a 64bit environment. The

graphics driver version was 295.71.

The hardware used for the CPU experimental runs was an Intel Core i7 2600K (3.4GHz). For the GPU executions,

an Nvidia GTX-480 and a Tesla C2050 [21] were used. Both GPUs are Fermi architecture based, featuring global

memory cache which is essential for the performance of our kernel.

Three different series of experimental runs were performed, each investigating the GPU and CPU implementations

from a different aspect. The first series of runs were performed in order to determine the most efficient out of the

3 developed kernels applying the LMSOR method. The second series of runs was performed in order to compare

the GPU version with the CPU version, for the three problems, in terms of performance, on various Re values. The

fluctuation of Re values causes a varying number of required iterations to meet convergence. The last series of runs

was performed to measure the performance of the GPU kernel and the CPU on one specific problem, on a wider range

of mesh sizes where the CPU version execution is heavily time-consuming.

In the results that follow, two different time measurements were carried out. The first, referred as computation time, is

the net computation time without the PCI-Express extra overhead. The second, referred as execution time, includes the

aforementioned overhead time. The function used to measure time is the gettimeofday() function, which is available

on Linux platform.

4.2. Three kernel comparison

All kernels, of both methods, were executed in solving the three aforementioned problems, on mesh size h = k =1√N+1

where√

N = M1 = M2 = 402, 2002. The GPUs used in this experiment were the GTX480 and Tesla C2050.

The results of the executions for the former GPU are depicted on table 1 and for the latter on table 2.

Large matrices are more important, as the GPUs are optimized for massive parallelism and therefore suited for large

√N = 402

√N = 2002

kernel kernel kernel kernel kernel kernel

f,g Experimental results #1 #2 #3 #1 #2 #3

(R,I) (0,161604) (0,4008004)

Iterations 412 412 412 1998 1998 1998

1 Computation time (secs) 0.0545 0.0462 0.0758 3.5832 2.7692 5.6599

Total execution time (secs) 0.0591 0.0506 0.0789 3.6417 2.8104 5.6907

Comp. time/iteration (msecs) 0.1323 0.1121 0.1839 1.7934 1.3860 2.8328

(R,I) (161604,0) (4008004,0)

Iterations 554 554 554 2704 2704 2704

2 Computation time (secs) 0.0737 0.0610 0.0973 4.8972 3.7625 7.1409

Total execution time (secs) 0.0798 0.0654 0.1009 4.9531 3.8094 7.1725

Comp. time/iteration (msecs) 0.1330 0.1101 0.1757 1.8111 1.3914 2.6409

(R,I) (0,161604) (0,4008004)

Iterations 1015 1015 1015 2018 2018 2018

3 Computation time (secs) 0.1346 0.1126 0.1870 3.6549 2.8075 5.7202

Total execution time (secs) 0.1407 0.1171 0.1906 3.7083 2.8495 5.7495

Comp. time/iteration (msecs) 0.1326 0.1110 0.1843 1.8111 1.3912 2.8346

Table 1: Kernel comparison in LMSOR execution on GTX480, for√

N = 402, 2002, Re = 10.0.

array processing. Thus, it is sensible to focus on the case where√

N = 2002.

Judging from the GTX480 results, kernel #3 presents the worst performance. Although its theoretical ratio of floating

7

√N = 402

√N = 2002

kernel kernel kernel kernel kernel kernel

f,g Experimental results #1 #2 #3 #1 #2 #3

(R,I) (0,161604) (0,4008004)

Iterations 412 412 412 1998 1998 1998

1 Computation time (secs) 0.0760 0.0641 0.0742 5.3398 4.3325 4.3790

Total execution time (secs) 0.0887 0.0734 0.0818 5.4997 4.4453 4.4677

Comp. time/iteration (msecs) 0.1845 0.1555 0.1801 2.6726 2.1684 2.1917

(R,I) (161604,0) (4008004,0)

Iterations 554 554 554 2704 2704 2704

2 Computation time (secs) 0.1024 0.0859 0.0949 7.2575 5.8540 5.7670

Total execution time (secs) 0.1156 0.0954 0.1025 7.4173 5.9669 5.8554

Comp. time/iteration (msecs) 0.1848 0.1550 0.1713 2.6840 2.1650 2.1328

(R,I) (0,161604) (0,4008004)

Iterations 1015 1015 1015 2018 2018 2018

3 Computation time (secs) 0.1874 0.1573 0.1822 5.4144 4.3725 4.4584

Total execution time (secs) 0.2006 0.1659 0.1898 5.5746 4.4848 4.5468

Comp. time/iteration (msecs) 0.1847 0.1550 0.1795 2.6831 2.1668 2.2093

Table 2: Kernel comparison in LMSOR execution on Tesla C2050, for√

N = 402, 2002, Re = 10.0.

point operations per byte accesses is near optimal, this kernel is stressed by the high usage of transcendental operations

(i.e. sinusoidals, square root, divisions) that are required for the computation of ω factor as seen in (23), (24), (25),

and (27). The values given in the previous section for theoretical peak FLOPS regard only addition and multiplication

operations [17]. Transcendental instructions in double precision, require multiple instructions to be implemented and

therefore significantly lower throughput is provided. Additionally, the excessive use of registers lowers occupancy

(table 3) from 0.67% to 0.5% and this factor potentially decreases the feasible bandwidth. Therefore, in practice this

kernel is not the optimal.

Kernel #2 seems to be the best performing of all. Although, it executes more operations per computed element,

kernel #1 kernel #2 kernel #3

fb subp0 read sectors 7200638 5478087 4429064

fb subp0 write sectors 1001557 1001456 1008318

fb subp1 read sectors 7202176 5476972 4431012

Profiling counters fb subp1 write sectors 1001000 1001000 1001000

occupancy 0.667 0.667 0.5

registers/thread 22 29 42

gputime (µsecs) 3494.3 2678.6 5144.7

Total reads (MB) 439.54 334.32 270.39

Total writes (MB) 61.11 61.11 61.32

Total transfer (MB) 500.65 395.43 331.71

Extrapolated results Accesses per element 8 6 5

Assumed transfer (MB) 488.40 366.33 305.30

Difference (%) 2.5% 7.9% 8.7%

Bandwidth (GB/sec) 139.92 144.17 62.96

Million elements/sec 2289.44 2986.63 1555.00

Table 3: Profiling on one iteration of red elements calculation on the GTX480, for√

N = 4002.

it actually performs better (≈ 30%) than the first one, revealing the memory throughput bottleneck in this program.

The extra operations needed as seen in (3) are just additions and multiplications which can be efficiently translated

to single efficient instructions. On the C2050 the improvement of kernel #2 was slightly lower (≈ 24%). Another

observation on the Tesla GPU results (table 1) is that kernel #3 is only slightly slower and in the 2nd case even faster

than kernel #2 due to Tesla’s higher double precision operation throughput.

Another advantage of kernel #3 is its limited memory access requirement. Kernel #1 makes use of 6√

N ×√

N

matrices, one for each mesh. Kernel #2 makes use of 4 matrices of the same order and kernel #3 makes use of 3

matrices of the same order. This makes the latter suitable for solving a larger problem when memory size is a critical

limitation.

Due to the different memory access requirements of the 3 kernels, inspecting the effective bandwidth can lead to

misleading conclusions about the performance of each kernel. As can be seen on table 3, some profiling data were

captured during one iteration of computation of red elements, for√

N = 4002. Bytes accessed by kernel were

8

estimated by using the first four performance counters [19]. For kernel #1 the achieved effective bandwidth was

estimated to be almost 140GB/sec, computing about 2290 million elements/sec. Kernel #2 achieved calculating

near 3000 million elements. In contrast, kernel #3 suffers by high instruction execution pressure. Each kernel is

characterized by different memory bandwidth requirements and thus, it cannot be used as a direct comparison measure.

Thus, pure bandwidth does not expose the actual performance of these kernels.

4.3. CPU - GPU comparison

In this series of executions the GPU kernel #2, and the CPU program were compared, for both methods, in

executions for various Re values. Matrix order√

N was kept constant (√

N = 1002) and the program was executed

for Re=1000.0, 100000.0. Results are depicted in table 4. The GPU reaches a speedup factor of ×8.5 over the

single threaded CPU version. The OpenMP version presents ×1.25 speedup over the single threaded version due to

the memory bandwidth bottleneck. At this problem size the program is memory bound for the CPU as the workload

does not fit in cache memory (4 × 10022 × 8 ≈ 32MB). The GPU shows a stable performance behavior by computing

elements at a rate of less than 0.4 milliseconds per iteration.

Re = 1000.0 Re = 100000.0

f,g Experimental results CPU (seq) CPU (par) GPU CPU (seq) CPU (par) GPU

(R,I) (0,1004004) (0,1004004)

Iterations 2620 2620 2620 6243 6243 6243

1 Computation time (secs) 8.9637 7.1538 1.0394 21.3430 17.0938 2.4791

Total execution time (secs) 8.9637 7.1538 1.0520 21.3430 17.0938 2.4895

Comp. time/iteration (msecs) 3.4213 2.7305 0.3967 3.4187 2.7381 0.3971

Computation speedup 1.0000 1.2530 8.6240 1.0000 1.2486 8.6093

(R,I) (0,1004004) (0,1004004)

Iterations 1003 1003 1003 3170 3170 3170

2 Computation time (secs) 3.4044 2.7416 0.3975 11.1013 8.6644 1.2591

Total execution time (secs) 3.4044 2.7416 0.4099 11.1013 8.6644 1.2695

Comp. time/iteration (msecs) 3.3942 2.7334 0.3963 3.5020 2.7333 0.3972

Computation speedup 1.0000 1.2417 8.5647 1.0000 1.2813 8.8167

(R,I) (0,1004004) (0,1004004)

Iterations 5514 5514 5514 7034 7034 7034

3 Computation time (secs) 18.8657 15.0714 2.1913 23.9323 19.2778 2.7934

Total execution time (secs) 18.8657 15.0714 2.2039 23.9323 19.2778 2.8060

Comp. time/iteration (msecs) 3.4214 2.7333 0.3974 3.4024 2.7407 0.3971

Computation speedup 1.0000 1.2518 8.6092 1.0000 1.2414 8.5674

Table 4: CPU (sequential and parallel versions) and GPU (GTX480) comparison in the 3 investigated problems, for√

N = 1002, for various values

of Re, for the three problems.

4.4. CPU - GPU scalability

The CPU and GPU versions were executed for a wider range of mesh sizes with√

N = 402, 1002, 2002, 3002, 4002,for the 2nd problem and Re = 10.0. The results are depicted on table 5.

The speed-up observed is further increased as√

N obtains higher values. For mesh size with√

N = 4002, the speed-

up exceeds ×10. The GTX-480 needs just 29.03 seconds to execute 5406 iterations on that mesh which is near 5.4

milliseconds per iteration. This rate reaches near 3 billion elements computed per second. These numbers include the

time required for checking of convergence criterion.

The rate of computations of elements per second and the speed-up observed for the GPU computation times can be

summarized on figure 1.

The C2050, although targeted to HPC environments it lacks the high bandwidth of the GTX480. Additionally, as

the Tesla ECC protections was enabled, the memory bandwidth was further stressed roughly by 20% [20]. Thus, the

performance results are lower on C2050 than on GTX-480, which does not feature ECC memories.

The parallel CPU version achieves about 370 Million elements/sec on the problem√

N = 4002 which corresponds to

6 × 8 × 370 ≈ 16.5 GB/sec bandwidth. On the√

N = 402 problem the throughput is much higher (≈ 57GB/sec) as

the workload (4 × 4022 × 8 ≈ 5MB workload) fits in cache memory (8MB L3 cache) and the CPU is able to compute

much more elements/sec. Therefore, in this case the speedup of the OpenMP version to the sequential one is ×3.3 on

this 4 core CPU.

9

Million Elements

Matrix Total (R,I) Model Execution Computation computed Computation√N ×√

N Iterations time time per second Speed-up

(a) 0.23 0.23 386.21 1.00

402 × 402 554 (161604,0) (b) 0.07 0.07 1269.93 3.29

(c) 0.07 0.06 1456.39 3.77

(d) 0.10 0.09 1024.37 2.66

(a) 4.73 4.73 292.52 1.00

1002 × 1002 1384 (1004004,0) (b) 3.79 3.79 365.08 1.25

(c) 0.56 0.55 2515.71 8.60

(d) 0.86 0.83 1661.71 5.68

(a) 36.68 36.68 294.86 1.00

2002 × 2002 2704 (4008004,0) (b) 29.43 29.43 367.55 1.25

(c) 3.80 3.76 2875.14 9.75

(d) 5.95 5.85 1847.45 6.27

(a) 126.39 126.39 288.74 1.00

3002 × 3002 4055 (9012004,0) (b) 98.97 98.97 368.75 1.28

(c) 12.40 12.32 2963.09 10.26

(d) 19.73 19.53 1868.75 6.47

(a) 300.43 300.43 287.91 1.00

4002 × 4002 5406 (16016004,0) (b) 235.23 235.23 367.71 1.28

(c) 29.18 29.03 2979.09 10.35

(d) 46.77 46.43 1863.02 6.47

Table 5: Various executions for the 2nd problem, on (a) sequential and (b) parallel implementations on CPU Intel Core i7 2600K, on (c) GPU

NVidia GTX480 (kernel #2) and (d) Tesla C2050 (kernel #2), for mesh sizes with√

N = 402, 1002, 2002, 3002, 4002 and Re=10.0.

1400.0

1600.0

1800.0

2000.0

2200.0

2400.0

2600.0

2800.0

3000.0

em

en

ts c

om

pu

ted

/seco

nd

CPU - Sequential

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

402 794 1186 1578 1970 2362 2754 3146 3538 3930

Millio

n e

lem

en

ts c

om

sqrt(N)

CPU - Sequential

CPU - OpenMP

GTX480

C2050

5

6

7

8

9

10

11

Sp

ee

du

p

0

1

2

3

4

5

402 794 1186 1578 1970 2362 2754 3146 3538 3930

Sp

ee

du

sqrt(N)

CPU - OpenMP

GTX480

C2050

Figure 1: Million elements computed per second on CPU & GPUs (top) and Computation speed-up of GPUs over CPU (bottom) for different

matrices.

10

5. Remarks and Conclusions

GPU is a suitable platform for massive parallel computations like those provided by the red/black ordering of

iterative methods in solving systems of linear equations. In order to achieve memory coalescing, the locality of

accesses must be ensured. Thereafter, the high memory bandwidth of the GPU can be exploited and attain high

performance.

Recomputation can be beneficial in cases where memory accessing becomes a bottleneck for both GPUs and CPUs.

Instead of keeping the processing units idle, one strategy is to recompute data in order to avoid multiple memory

accesses. This is a tradeoff and in many cases when a kernel is bandwidth limited, compute resources can be traded

for less demand in memory bandwidth. It is applicable when a few operations at most are required for recomputation,

so that computation does not turn to a bottleneck. It can provide a performance speed-up and moreover, it can release

portions of device memory, allowing to solve larger problems.

Even in cases where recomputation is excessively applied, although performance is worsened, there can be other

benefits. Recomputation leaves more available memory for other uses and thus a bigger problem is allowed to be

solved. The size of the problem that is to be solved can determine the appropriate kernel to be used.

Acknowledgments.

We would like to acknowledge the kind permission of the Innovative Computing Laboratory at the University of

Tennessee to use their NVidia Tesla C2050 installation for the purpose of this work.

References

[1] L. M. Adams, R. J. Leveque and D. Young, “Analysis of the SOR iteration for the 9-point Laplacian”, SIAM J. Num. Anal., vol. 9, pp.

1156-1180, 1988.

[2] L. A. Boukas and N. M. Missirlis, “The Parallel Local Modified SOR for Nonsymmetric Linear Systems”, Intern. J. Computer Math., Vol.68,

pp 153-174, 1998.

[3] E. F. Botta and A. E. P. Veldman, “On local relaxation methods and their application to convection-diffusion equations”, J. Comput. Phys., vol.

48, pp. 127-149, 1981.

[4] Y. Cotronis, E. Konstantinidis, M. A. Louka and N. M. Missirlis, “Parallel SOR for solving the Convection Diffusion equation using GPUs

with CUDA”, EuroPar 2012 Parallel Processing, International European Conference on Parallel and Distributed Computing, Lecture Notes in

Computer Science, Vol. 7484/2012, pp. 575-586, Rhodos, 2012.

[5] L. W. Ehrlich, “An Ad-Hoc SOR Method”, J. Comput. Phys., vol. 42, pp. 31-45, 1981.

[6] L. W. Ehrlich, “The Ad-Hoc SOR method: A local relaxation scheme, in elliptic Problem Solvers II”, Academic Press, New York, pp. 257-269,

1984.

[7] L. Ha, J. Kroger, S. Joshi and C. T. Silva, “Multiscale Unbiased Diffeomorphic Atlas Construction on Multi-GPUs“, GPU Computing Gems,

Morgan Kaufmann, 2011.

[8] L. A. Hageman and D. M. Young, “Applied Iterative Methods”, Academic Press, New York, 1981.

[9] K. Komatsu, T. Soga, R. Egawa, H. Takizawa, H. Kobayashi, S. Takahashi, D. Sasaki and K. Nakahashi, “Parallel processing of the Building-

Cube Method on a GPU platform“, Computers & Fluids, In Press, Corrected Proof 2011.

[10] E. Konstandinidis and Y. Cotronis, “Accelerating the red/black SOR method using GPUs with CUDA”, 9th International Conference On

Parallel Processing And Applied Mathematics, Torun, Part I, LNCS 7203, pp.589, 2011.

[11] C.-C J. Kuo, B. Levy and B. R. Musicus, A Local Relaxation Method for solving elliptic PDEs on mesh-connected arrays, SIAM J. Sci.Stat.

Comput., Vol. 8, No 4, 1987.

[12] R. J. Leveque and L. N. Trefethen, “Fourier analysis of the SOR iteration”, ICASE-NASA Langley Research Center, Report 86-93, Hampton,

1986.

[13] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable Parallel Programming with CUDA”, ACM SIGGRAPH 2008 classes, Vol. 16,

2008, pp. 1-14.

[14] J. M. Ortega and R. G. Voight, “Solution of Partial Differential Equations on Vector and Parallel Computers”, SIAM, Philadelphia, 1985.

[15] R. S. Varga, “Matrix Iterative Analysis”, Englewood Cliffs, NJ : Prentice-Hall, 1962.

[16] D. M. Young, “Iterative Solution of Large Linear Systems”, New York : Academic Press, 1971.

[17] “NVidia CUDA Reference Manual v. 4.0”, NVidia, 2011.

[18] “NVidia CUDA C Best Practices Guide Version 4.0”, NVidia, 2011.

[19] “Compute Visual Profiler User Guide”, NVidia, 2011.

[20] “Tuning CUDA Applications for Fermi”, NVidia, 2011.

[21] “Tesla C2050 And Tesla C2070 Computing Processor Board”, NVidia, 2011.

[22] “The OpenCL Specification”, Khronos group, 2009.

[23] OpenMP Application Program Interface version 3.0. OpenMP Architecture Review Board, 2008.

11