Design and implementation of parallel approximate inverse classes using OpenMP

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2009; 21:115–131Published online 6 June 2008 inWiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1324

Design and implementation ofparallel approximate inverseclasses using OpenMP

Konstantinos M. Giannoutakis andGeorge A. Gravvanis∗,†

Department of Electrical and Computer Engineering, School of Engineering,Democritus University of Thrace, 12, Vas. Sofias Street, GR 671 00 Xanthi, Greece

SUMMARY

A new parallel normalized optimized approximate inverse algorithm, based on the concept of antidiag-onal wave pattern, for computing classes of explicitly approximate inverses, is introduced for symmetricmultiprocessor systems. The parallel normalized explicit approximate inverses are used in conjunctionwith parallel normalized explicit preconditioned conjugate gradient schemes for the efficient solution offinite element sparse linear systems. The parallel design and implementation issues of the new algorithmare discussed and the parallel performance is presented using OpenMP. Copyright © 2008 John Wiley &Sons, Ltd.

Received 5 June 2007; Revised 10 February 2008; Accepted 10 February 2008

KEY WORDS: sparse linear systems; normalized approximate inverse algorithm; preconditioning; parallelapproximate inverses; parallel preconditioned conjugate gradient method; parallel computations;symmetric multiprocessor systems; OpenMP

1. INTRODUCTION

Let us consider the linear system derived from the discretization of many scientific and engineeringthree-dimensional (3D) problems by the finite element (FE) method, i.e.

Au= s (1)

where the coefficient matrix A is a non-singular large, sparse, symmetric, positive-definite, diago-nally dominant (n× n) matrix of irregular structure (where all the off-centre band terms are grouped

∗Correspondence to: George A. Gravvanis, Department of Electrical and Computer Engineering, School of Engineering,Democritus University of Thrace, 12, Vas. Sofias Street, GR 671 00 Xanthi, Greece.

†E-mail: [email protected]

Contract/grant sponsor: Science Foundation Ireland

Copyright q 2008 John Wiley & Sons, Ltd.

116 K. M. GIANNOUTAKIS AND G. A. GRAVVANIS

into a regular bands, i.e. submatrix V of width �1 at semi-bandwidth m and submatrix W of width�2 at semi-bandwidth p, as can been seen in Figure 1), u is the FE solution at the nodal pointsand s is a vector, of which the components result from a combination of source terms and imposedboundary conditions.In some applications, iterative methods often fail and preconditioning is necessary, though not

always sufficient, to attain convergence in a reasonable amount of time. The term preconditioningrefers to transforming the linear system (1) into another system with more favourable properties foriterative solution, such as a smaller spectral condition number and better convergence behaviour,cf. [1–4]. The preconditioned form of the linear system (1) is

MAu=Ms (2)

where M is a suitable preconditioner, cf. [5,6]. The preconditioner M therefore has to satisfy thefollowing conditions: (i) MA should have a ‘clustered’ spectrum, (ii) M can be efficiently computedin parallel and (iii) finally ‘M × vector’ should be fast to compute in parallel, cf. [3,7,8].Discussions on the form of M derived from splitting techniques and incomplete (IC, ILU and

MILU and variants) factorization techniques (based on modifications of Gaussian elimination)have been presented by many researchers, but it is difficult to implement them on parallel systems,cf. [3,4,9,10]. Level-scheduling approach or the wavefront approach has been used to eliminatethe implicitness but was found to be of limited potential, cf. [3,4,10]. In the case of polynomialpreconditioners, although they have inherent parallelism they do not improve considerably the rateof convergence and are of limited use today, cf. [3,11]. Similarly, red–black ordering for well-structured problems has not been found to be successful, cf. [3]. Recently sparse approximateinverse preconditioning has been introduced, based on factorized sparse approximate inverses or onthe minimization of some convenient norm, cf. [12–16]. Additionally approximate inverses basedon incomplete factors have also been introduced. It should be noted that sparse approximate inversesby minimizing the Frobenius norm of the error have been presented and can be implemented onparallel systems, cf. [7,8,17].

Figure 1. Coefficient matrix A.

Copyright q 2008 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2009; 21:115–131DOI: 10.1002/cpe

PARALLEL APPROXIMATE INVERSE CLASSES 117

Our main motive for the derivation of the new normalized approximate inverse FE matrix tech-niques is that they can be efficiently used in conjunction with normalized explicit preconditionedconjugate gradient (NEPCG) schemes on symmetric multiprocessor systems. It is well knownthat preconditioned conjugate gradient schemes, based on IC or ILU type factorization methods,involve forward–backward substitution, which does not parallelize easily, cf. [4,10]. Hence, theimportant feature of the proposed parallel approximate inverse preconditioning is that the approxi-mate inverse is computed explicitly and in parallel, eliminating the forward–backward substitution.Additionally, the dominant computational work of the preconditioned conjugate gradient schemes isvector× vector and banded matrix× vector operations, which parallelize efficiently on symmetricmultiprocessor systems, cf. [2,18,19].The challenge encountered when computing parallel approximate inverses is its internal data

dependencies, which create both a critical path and an order of computations, such that any com-putational strategy adopted should abide by those dependencies. For the parallel construction ofthe normalized approximate inverse preconditioner, a transformation of the sequential ‘fish-bone’pattern to an antidiagonal wave pattern has been carried out in order to overcome the data de-pendencies. The elements located on an antidiagonal are independent, as the computation of eachelement requires at least its right element and can be computed concurrently by the available pro-cessors. Thus, a consecutive antidiagonal movement through the banded matrix would eliminateall dependencies.The computation of each antidiagonal is assigned to the available processors by continuous

blocks of elements. The degree of parallelism depends on the ‘retention’ parameter (length of theantidiagonal) and the number of processors. By increasing the value of the ‘retention’ parameter,the workload per process overcomes the parallelization overheads, and the obtained speedups tendto the upper theoretical bound.The inherently parallel linear operations between vectors and matrices involved in the NEPCG

schemes exhibit significant amounts of loop-level parallelism that can lead to high performancegain on shared address space systems, cf. [2,18,19].For the implementation of the parallel programs, the OpenMP application programming interface

has been used. OpenMP has emerged as a shared-memory programming standard and it consistsof compiler directives and functions for supporting both data and functional parallelism. The par-allel for pragma with static scheduling has been used for the parallelization of loops in both theconstruction of the approximate inverse and the conjugate gradient scheme.In Section 2, a new parallel FE normalized approximate inverse algorithm is introduced, based on a

wave pattern approach. In Section 3, a parallel normalized preconditioned conjugate gradient methodfor solving sparse FE systems is presented. Finally in Section 4, the performance and applicabilityof the new proposed parallel approximate inverse preconditioning methods are discussed and theparallel performance on symmetric multiprocessor systems is given using OpenMP.

2. PARALLEL NORMALIZED APPROXIMATE INVERSES

In this section we present a parallel implementation of the FE normalized approximate inversealgorithm for solving linear systems on symmetric multiprocessor systems.



Let us assume the normalized sparse approximate factorization, cf. [20,21], of the coefficientmatrix A, such that

A ≈ Dr1,r2Ttr1,r2Tr1,r2Dr1,r2, r1 ∈ [1, . . . ,m − 1), r2 ∈ [1, . . . , p − 1) (3)

where r1 and r2 are the ‘fill-in’ parameters, i.e. the number of outermost off-diagonal entries retainedat semi-bandwidths m and p, Dr1,r2 is a diagonal matrix

Dr1,r2 ≡ diag(d1, . . . , dm−1...dm, . . . , dp−1

...dp, . . . , dn) (4)

and Tr1,r2 is a sparse upper (with unit diagonal elements) triangular matrix (Figure 2) of the sameprofile as the coefficient matrix A.The elements of the decomposition factors were computed by the FEANOF-3D algorithm, cf.[20]. The factorization procedure requires the storage of submatrices H and F as described in [20].Let M�l

r1,r2 = D−1r1,r2 M̂�lr1,r2D

−1r1,r2 = (�i, j ), i ∈ [1, n], j ∈ [max(1, i−�l+1),min(n, i+�l−1)], be

the banded form of the normalized approximate inverse of the coefficient matrix A. The elementsof the banded form of the approximate inverse can be determined by retaining a certain number ofelements of the inverse, i.e. only �l elements in the lower part and �l−1 elements in the upper partof the inverse next to the main diagonal. Then the elements can be computed by solving recursivelythe following systems:

M̂�lr1,r2T

tr1,r2 = (Tr1,r2)

−1 and Tr1,r2 M̂�lr1,r2 = (T t

r1,r2)−1 (5)

without inverting the decomposition factors using the normalized banded approximate inverse FEmatrix algorithmic procedure (henceforth called the NORBAIFEM-3D algorithm).The computational work of the NORBAIFEM-3D algorithm is O[n�l(r1 + r2 + �1 + �2 + 1)]

multiplicative operations, whereas the storage of the approximate inverse in only n× (2�l − 1)-vector elements is achieved using an optimized storage scheme, based on a moving window shiftedfrom bottom to top, cf. [20,22]. The optimized NORBAIFEM-3D algorithm (henceforth called the

Figure 2. Upper triangular matrix Tr1,r2 .



NOROBAIFEM-3D algorithm) is particularly effective for solving ‘narrow-banded’ sparse systemsof very large order, i.e. �l�n/2.It should be noted that this class of normalized approximate inverse includes various families of

approximate inverses according to the requirements of accuracy, storage and computational work,as can be seen by the following diagrammatic relation, cf. [20,21]:

class I class II

A−1← D−1r1,r2ˆ̃M�lr1=m−1,r2=p−1D

−1r1,r2 ← D−1r1,r2 M̂

�lr1=m−1,r2=p−1D

−1r1,r2

class III class IV

← D−1r1,r2 M̂�lr1,r2D

−1r1,r2 ← D−2r1,r2

(6)

where the entries of the class I inverse have been retained after the computation of the exactinverse (r1=m−1, r2= p−1); the entries of the class II inverse have been computed and retainedduring the computational procedure of the (approximate) inverse (r1=m − 1, r2= p− 1); and theentries of the class III inverse have been retained after the computation of the approximate inverse(r1≤m − 1, r2≤p − 1). Hence, an approximate inverse is derived in which both the sparseness ofthe coefficient matrix is relatively retained in the approximate inverse and storage requirements aresubstantially reduced, i.e. the sparsity and the storage requirements of the approximate inverse forn= 8000, m= 21 and p= 401 are shown in Table I. The class IV of approximate inverse retainsonly the diagonal elements, i.e. �l = 1, hence M̂�l

r1,r2 ≡ I , resulting in a fast inverse algorithm.It should be noted that the convergence behaviour of the normalized approximate inverse pre-

conditioning and the performance of the FEANOF-3D, NOROBAIFEM-3D algorithm and theNEPCG-type methods have been presented in [21].The challenge of computing parallel approximate inverses is to overcome its data dependencies,

which create a critical path and an order of computations; hence, any parallel approximate inversematrix algorithm must abide by those dependencies in order to avoid any data loss.For the parallelization of the NOROBAIFEM-3D algorithm, an antidiagonal motion (wave-like

pattern), starting from the element �̂n,n down to �̂1,1, has been used, because of the dependencyof the elements of the inverse during its construction. More specifically, any element within thebanded approximate inverse requires its corresponding right or lower element to be computed first.This sequence of computations, without any loss of generality and for simplicity reasons, is shownfor the normalized banded approximate inverse in Figure 3 (with n= 8, �l = 4). The values inthe parentheses at the superscript of each element (e.g. �̂(k)

i, j ) indicate that the element �̂i, j wascomputed at the (k)th sequential step of the algorithm (kth antidiagonal) and the elements with

Table I. The storage requirements and the sparsity of the approximate inverse for n= 8000, m= 21 and p= 401.

�l = 1 �l = 2 �l =m/2 �l =m �l = 2m �l = p/2 �l = p �l = 2p �l = 4p �l = 6p

Vectors 0 3 19 41 83 399 801 1603 3207 4811Sparsity (%) 100 99.9 99.8 99.5 99.0 95.0 90.0 80.0 59.9 39.9



Figure 3. The sequence of computations for the approximate inverse.

Figure 4. The four computational zones of the NOROBAIFEM-3D algorithm.

the same superscript (i.e. (k)) were computed concurrently. It should be noted that due to the datadependencies, for �l = 1, 2 the parallel algorithm will execute sequentially.For the parallel construction of the optimized form of the approximate inverse, as diagrammati-

cally shown in Figure 3, a simple transformation of the indexes of the elements of the approximateinverse is used, cf. [22].



As the approximate inverse is symmetric, the elements to be computed are limited to at most�l/2 on each antidiagonal, by assigning �̂ j,i = �̂i, j or equivalently �̂n−i+1,�l+i− j = �̂n−i+1,i− j+1for the optimized form.Let us consider that the command forall denotes the parallel for instruction (forks/joins threads)

for executing parallel loops. Then, the algorithm for the implementation of the Parallel ANti Diag-onal NOROBAIFEM-3D algorithm (henceforth called the PAND-NOROBAIFEM-3D algorithm),on symmetric multiprocessor systems, can be described as follows:

\\lower triangle-shaped zonefor k= 1 to �l

forall �= 1 to k/2call inverse(n − �+ 1, n − k + �)

m= 2\\middle antidiagonal length zonefor k= (�l + 1) to n

forall �=m to k/2call inverse(n − �+ 1, n − k + �)

if (k − �l)mod 2= 0 thenm=m + 1

m=m − 1for k= (n − 1) downto (�l + 1)

forall �=m to k/2call inverse(�, k − �+ 1)

if (k − �l)mod 2= 1 thenm=m − 1

\\upper triangle-shaped zonefor k= �l downto 1

forall �= 1 to k/2call inverse(�, k − �+ 1)

This algorithm implements the antidiagonal motion through the banded form of the approximateinverse. The elements of the approximate inverse matrix can be divided into three distinct zones,the lower and the upper triangle-shaped zones and the middle antidiagonal length zone, as shown inFigure 3. The function inverse (i, j) computes the element (�̂i, j ) according to the NOROBAIFEM-3D algorithm. This algorithm is divided into four zones according to the index j of the element tobe computed, as shown in Figure 4.

Function inverse (i, j).Letr�1 = r1 + �1; r�2= r2 + �2; r�11= r�1− 1; r�21= r�2− 1;mr1=m − r1pr2 = p − r2;m�1=m + �1; p�2= p + �2; nmr1= n − m + r1; npr2= n − p + r2

(7)

if i >= j then\\Computation of the elements on and below the main diagonal

if j>nmr1 then\\Computation of the elements within zone1

if i = j then



\\Computation of the elements on the diagonal of zone1if i = n then

�̂1,1= 1 (8)

else

�̂n−i+1,1= 1− g j · �̂n− j,�l+1 (9)

else\\Computation of the elements below the diagonal of zone1

�̂n−i+1,i− j+1=−g j · �̂n−i+1,i− j (10)

elseif j>npr2 and j≤nmr1 then

\\Computation of the elements within zone2if i = j then

\\Computation of the elements on the diagonal of zone2

�̂n−i+1,1= 1− g j · �̂n− j,�l+1 −nmr1− j∑k=0

hr�11−k, j+k+1−r1 · �̂x,y

call mw (n, �l, i, j + mr1+ k, x, y) (11)

else\\Computation of the elements below the diagonal of zone2

�̂n−i+1,i− j+1=−g j · �̂n−i+1,i− j −nmr1− j∑k=0

hr�11−k, j+k+1−r1 · �̂x,y

call mw (n, �l, i, j + mr1+ k, x, y) (12)

elseif j≥r�1 and j≤npr2 then

\\ Computation of the elements within zone3if i = j then

\\ Computation of the elements on the diagonal of zone3

�̂n−i+1,1 = 1− g j · �̂n− j,�l+1 −npr2− j∑k=0

fr�21−k, j+k+1−r2 · �̂x1,y1

−nmr1− j∑k=0

hr�11−k, j+k+1−r1 · �̂x2,y2 (13)

call mw (n, �l, i, j + k + pr2, x1, y1) call mw (n, �l, i, j + k + mr1, x2, y2)else



\\ Computation of the elements below the diagonal of zone3

�̂n−i+1,i− j+1 =−g j · �̂n−i+1,i− j −npr2− j∑k=0

fr�21−k, j+k+1−r2 · �̂x1,y1

−nmr1− j∑k=0

hr�11−k, j+k+1−r1 · �̂x2,y2 (14)

call mw (n, �l, i, j + k + p − r2, x1, y1) call mw (n, �l, i, j + k + m − r1, x2, y2)else

\\ Computation of the elements within zone4if i = j then

\\ Computation of the elements on the diagonal of zone4if i = 1 then

�̂n,1= 1− g1 · �̂n−1,�l+1 −�2∑k=1

f1,k · �̂x1,y1 −�1∑k=1

h1,k · �̂x2,y2 (15)

call mw (n, �l, 1, p+k−1, x1, y1) call mw (n, �l, 1,m+k−1, x2, y2)else

�̂n−i+1,1 = 1− g j · �̂n− j,�l+1 −j−1∑k=1

h j−k,�1+k · �̂x1,y1 −�1∑

k= j+1−r1h j,k · �̂x2,y2

−j−1∑k=1

f j−k,�2+k · �̂x3,y3 −�2∑

k= j+1−r2f j,k · �̂x4,y4 (16)

call mw (n, �l, i,m�1+k−1, x1, y1) call mw (n, �l, i,m+k−1, x2, y2)call mw (n, �l, i, p�2+k−1, x3, y3) call mw (n, �l, i, p+k−1, x4, y4)

else\\ Computation of the elements below the diagonal of zone4

�̂n−i+1,i− j+1 =−g j · �̂n−i+1,i− j −j−1∑k=1

h j−k,�1+k · �̂x1,y1 −�1∑

k= j+1−r1h j,k · �̂x2,y2

−j−1∑k=1

f j−k,�2+k · �̂x3,y3 −�2∑

k= j+1−r2f j,k · �̂x4,y4 (17)

call mw (n, �l, i,m�1+ k − 1, x1, y1) call mw (n, �l, i,m + k − 1, x2, y2)call mw (n, �l, i, p�2+ k − 1, x3, y3) call mw (n, �l, i, p + k − 1, x4, y4)

if i <> j then\\ Computation of the elements above the main diagonal

�̂n−i+1,�l+i− j = �̂n−i+1,i− j+1 (18)



The procedure mw(n, �l, s, q, x, y), cf. [22], reduces the memory requirements of the approximateinverse to only n× (2�l − 1)-vector spaces and can be described as follows:

procedure mw(n, �l, s, q, x, y)if s≥q then

x = n + 1− s; y= s − q + 1 (19)

else

x = n + 1− q; y= �l + q − s (20)

The computational process is logically divided into 2n− 1 sequential steps representing the 2n− 1antidiagonals in a matrix of order n, while synchronization between processes is needed after thecomputation of each antidiagonal to ensure that the elements of the matrix are correctly computed.The workload on each antidiagonal varies between 1 and �l elements for the lower and uppertriangle-shaped zones, whereas for the middle antidiagonal length zone the work load interchangesbetween �l − 1 and �l elements, see Figure 3. Thus, the parallel computational complexity for thelower or the upper triangle-shaped zones is

�l−1∑i=1

⌈⌈i

2

⌉/no proc

⌉O(r1 + r2 + �1 + �2 + 1)

multiplications, whereas for the middle zone is (2n− 2(�l − 1)− 1)��l/2 /no proc O(r1+ r2+�1 + �2 + 1) multiplications, where no proc denotes the number of processors. As the elements ofeach antidiagonal are partitioned between the processors (no proc), adding more processors resultsin finer granularity, which can lead to low efficiencies on some platforms where excessively finegranularity makes it harder to amortize the parallelization overheads.

3. PARALLEL NORMALIZED PRECONDITIONED CONJUGATE GRADIENTMETHOD

In this section we present a class of parallel NEPCG (PNEPCG) method based on the derivedparallel optimized approximate inverse, designed for symmetric multiprocessor systems.The PNEPCG algorithm for solving linear systems can then be described as follows:Let u0 be an arbitrary initial approximation to the solution vector u. Then,

forall j = 1 to n

(r0) j = s j − A(u0) j (21)

if �l = 1 thenforall j = 1 to n

(r∗0 ) j = (r0) j/(d2) j (22)

else



forall j = 1 to n

(r∗0 ) j =(

j∑k=max(1, j−�l+1)

�̂n+1−i,i+1−k(r0)k/dk +min(n, j+�l−1)∑

k= j+1�̂n+1−k,�l+k− j (r0)k/dk

)/(d) j

(23)

�0= r∗0 (24)

forall j = 1 to n (reduction +p0)p0= (r0) j ∗ (r∗0 ) j (25)

Then, for i = 0, 1, . . . , (until convergence) compute the vectors ui+1, ri+1, �i+1 and the scalarquantities �i , �i+1 as follows:forall j = 1 to n

(qi ) j = A(�i ) j (26)

forall j = 1 to n (reduction+ ti )

ti = (�i ) j ∗ (qi ) j (27)

�i = pi/ti (28)

forall j = 1 to n

(ui+1) j = (ui ) j + �i (�i ) j (29)

(ri+1) j = (ri ) j − �i (qi ) j (30)

if �l = 1 thenforall j = 1 to n

(r∗i+1) j = (ri+1) j/(d2) j (31)

elseforall j = 1 to n

(r∗i+1) j =(

j∑k=max(1, j−�l+1)

�̂n+1−i,i+1−k(ri+1)k/dk

+min(n, j+�l−1)∑

k= j+1�̂n+1−k,�l+k− j (ri+1)k/dk

)/(d) j (32)



forall j = 1 to n (reduction+ pi+1)

pi+1 = (ri+1) j ∗ (r∗i+1) j (33)

�i+1 = pi+1/pi (34)

forall j = 1 to n

(�i+1) j = (r∗i+1) j + �i+1(�i ) j (35)

It should be noted that the parallelization of the coefficient matrix A× vector operation has beenimplemented by taking advantage of the sparsity of the coefficient matrix A.

4. NUMERICAL RESULTS

In this section we examine the applicability and effectiveness of the proposed parallel schemes forsolving sparse linear systems.Let us consider a 3D boundary value problem with the Dirichlet boundary conditions:

uxx + uyy + uzz + u = F, (x, y, z)∈ R (36)

u(x, y, z)= 0, (x, y, z)∈ �R (36a)

where R is the unit cube and �R denotes the boundary of R. The domain is covered by a non-overlapping triangular network resulting in a hexagonal mesh. The right-hand side vector of system(1) was computed as the product of matrix A by the solution vector, with its components equalto unity. The ‘fill-in’ parameters were set to r1= r2= 2 and the width parameters were set to�1= �2= 3. The iterative process was terminated when ‖(yi+1 − yi )/(1+ yi+1)‖∞<10−5.The numerical results presented in this section were obtained on an SMP machine consisting of

16 2.2GHz Dual Core AMD Opteron processors, with 32GB RAM running Debian GNU/Linux(National University of Ireland, Galway). For the parallel implementation of the algorithms pre-sented, the Intel C Compiler v9.0 with OpenMP directives has been utilized with no optimizationenabled at the compilation level. It should be noted that due to administrative policies, we were notable to explore the full processor resources (i.e. more than eight threads).In our implementation, the parallel for pragma has been used in order to generate code that

forks/joins threads, in all cases. Additionally, static scheduling has been used (schedule (static)),whereas the use of dynamic scheduling has not produced improved results.For example, the inner product of Equation (33) is parallelized by the following OpenMP code:

#pragma omp parallel for schedule(static) shared(ri, rr) private( j) reduction(+ : p)for( j = 1; j <= n; j ++)

p= p + ri[ j] ∗ rr [ j];



while the parallel multiplication of the optimized approximate inverse× vector of Equations(31)–(32):

if (dl == 1){

#pragma omp parallel for schedule(static) shared(rr, ri, d) private( j)for( j = 1; j <= n; j ++)

rr [ j]= ri[ j]/(d[ j] ∗ d[ j]);}else{#pragma omp parallel for schedule(static) shared(am, rr, ri, d) private( j, k, sum)

for( j = 1; j <= n; j ++)

{sum= 0;for(k= max(1, j − dl + 1); k <= j; k ++)

sum= sum+ am[n + 1− j][ j + 1− k] ∗ (ri[k]/d[k]);for(k= j + 1; k <= min(n, j + dl − 1); k ++)

sum= sum+ am[n + 1− k][dl + k − j] ∗ (ri[k]/d[k]);rr [ j]= sum/d[ j];

}}The parallel execution time and the speedups of the PAND-NOROBAIFEM-3D method for severalvalues of the ‘retention’ parameter �l with n= 8000, m= 21, p= 401 are given in Tables II and III,respectively. The execution time presented in Table II is the mean value of the execution timecarried out 30 times for each case. It should be noted that the standard deviation computed for thetest runs was kept at low levels, i.e. for �l = p/2 and eight processors the standard deviation forthe PAND-NOROBAIFEM-3D method was 0.0375, and the mean value of the execution time was4.5282 s.

Table II. Execution time and processors allocated of the PAND-NOROBAIFEM-3D method, for several valuesof �l, with n= 8000, m= 21 and p= 401.

Timings for the PAND-NOROBAIFEM-3D method

Number of processors

‘Retention’ parameter 1 2 4 8

�l =m/2 1.7292 0.9434 0.5458 0.4223�l =m 3.2507 1.7722 1.0234 0.7162�l = 2m 6.4212 3.4423 1.9511 1.0909�l = p/2 30.6781 16.3718 8.3030 4.5282�l = p 61.5420 32.7402 16.6475 9.0323�l = 2p 122.9364 64.2896 33.1635 17.8452�l = 4p 244.4595 127.7553 65.9396 35.3248�l = 6p 366.6445 191.5986 98.8922 52.2218



Table III. Speedups and processors allocated of the PAND-NOROBAIFEM-3D method, for several values of�l, with n= 8000, m= 21 and p= 401.

Speedups for the PAND-NOROBAIFEM-3D method


‘Retention’ parameter 2 4 8

�l =m/2 1.8328 3.1682 4.0949�l =m 1.8343 3.1765 4.5388�l = 2m 1.8654 3.2910 5.8861�l = p/2 1.8738 3.6948 6.7750�l = p 1.8797 3.6968 6.8135�l = 2p 1.9122 3.7070 6.8891�l = 4p 1.9135 3.7073 6.9203�l = 6p 1.9136 3.7075 7.0209

Figure 5. Parallel speedups of the PAND-NOROBAIFEM-3D method with the computed theoretical estimates,for several values of �l, with n= 8000, m= 21 and p= 401.

In Figure 5 the parallel speedups for several values of the ‘retention’ parameter �l are pre-sented for the PAND-NOROBAIFEM-3D method along with the theoretical computed estimates,for n= 8000, m= 21 and p= 401. It should be mentioned that the computation of the derivedtheoretical estimates was based on the analysis presented in Section 2, where no latencies havebeen taken into consideration.The parallel execution time and the speedups of the PNEPCG method for several values of the

‘retention’ parameter �l with n= 8000, m= 21, p= 401 are given in Tables IV and V, respectively.The execution time presented in Table IV is the mean value of the execution time carried out 30times for each case. It should be noted that the standard deviation computed for the test runs waskept at low levels, i.e. for �l = p/2 and eight processors the standard deviation for the PNEPCGmethod was 0.0069, and the mean value of the execution time was 0.1389 s.



Table IV. Execution time and processors allocated of the PNEPCG method, for several values of�l, with n= 8000, m= 21 and p= 401.

Timings for the PNEPCG method


‘Retention’ parameter 1 2 4 8

�l = 1 0.0455 0.0247 0.0153 0.0116�l = 2 0.0876 0.0471 0.0288 0.0168�l =m/2 0.1339 0.0714 0.0433 0.0227�l =m 0.1586 0.0842 0.0509 0.0265�l = 2m 0.2434 0.1289 0.0757 0.0370�l = p/2 0.9142 0.4717 0.2662 0.1389�l = p 1.4722 0.7569 0.4236 0.2230�l = 2p 3.4768 1.7829 0.9955 0.5202�l = 4p 7.3666 3.7050 2.0489 1.0785�l = 6p 9.7077 4.8600 2.6367 1.4109

Table V. Speedups and processors allocated of the PNEPCG method, for several values of�l, with n= 8000, m= 21 and p= 401.

Speedups for PNEPCG method


‘Retention’ parameter 2 4 8

�l = 1 1.8447 2.9661 3.9251�l = 2 1.8600 3.0394 5.2025�l =m/2 1.8768 3.0940 5.9008�l =m 1.8832 3.1175 5.9841�l = 2m 1.8887 3.2139 6.5709�l = p/2 1.9383 3.4350 6.5811�l = p 1.9449 3.4750 6.6014�l = 2p 1.9501 3.4924 6.6834�l = 4p 1.9883 3.5954 6.8305�l = 6p 1.9975 3.6817 6.8805

In Figure 6 the parallel speedups for several values of the ‘retention’ parameter �l are presentedfor the PNEPCG method with n= 8000, m= 21, p= 401.It should be mentioned that for large values of the ‘retention’ parameter, i.e. multiples of the

semi-bandwidths m and p, the speedups and the efficiency tend to the upper theoretical bound, forboth the parallel construction of the approximate inverse and the parallel normalized preconditionedconjugate gradient method, as the coarse granularity amortizes the parallelization overheads. Forsmall values of the ‘retention’ parameter, i.e. �l = 1, 2, the fine granularity is responsible for thelow parallel performance, as the parallel operations reduce to simple ones, such as inner products,and the utilization of the hardware platform is less.



Figure 6. Parallel speedups of the PNEPCG method, for several values of �l, with n= 8000, m= 21 and p= 401.

5. CONCLUSIONS

The design of parallel explicit approximate inverses and preconditioned conjugate gradient-typeschemes results in efficient parallel methods for solving linear systems on symmetric multipro-cessor systems. The proposed approach for eliminating data dependencies, based on the antidi-agonal motion through the banded form of the approximate inverse, has been implemented withOpenMP and tested on a symmetric multiprocessor system providing good parallel results. The mainadvantage of the proposed method is that the approximate inverse is computed explicitly and canbe efficiently used in conjunction with parallel preconditioned conjugate gradient-type methods.Finally, further parallel algorithmic techniques will be investigated in order to improve the par-

allel performance of the normalized explicit approximate inverse preconditioning on symmetricmultiprocessor systems, particularly by increasing the computational work output per processorand eliminating process synchronization and any associated latencies.

ACKNOWLEDGEMENTS

The author would like to thank Dr John Morrison of the Department of Computer Science, University Collegeof Cork for the provision of computational facilities and support through the WebCom-G project funded byScience Foundation Ireland.

REFERENCES

1. Benzi M. Preconditioning techniques for large linear systems: A survey. Journal of Computational Physics 2002;182:418–477.

2. Dongarra JJ, Duff I, Sorensen D, van der Vorst HA. Solving Linear Systems on Vector and Shared Memory Computers.SIAM: Philadelphia, PA, 1991.

3. Saad Y, van der Vorst HA. Iterative solution of linear systems in the 20th century. Journal of Computational and AppliedMathematics 2000; 123:1–33.



4. van der Vorst HA. High performance preconditioning. SIAM Journal on Scientific and Statistical Computing 1989;10:1174–1185.

5. Lipitakis EA, Evans DJ. Explicit semi-direct methods based on approximate inverse matrix techniques for solvingboundary value problems on parallel processors. Mathematics and Computers in Simulation 1987; 29:1–17.

6. Lipitakis EA, Gravvanis GA. Explicit preconditioned iterative methods for solving large unsymmetric finite elementsystems. Computing 1995; 54(2):167–183.

7. Huckle T. Approximate sparsity patterns for the inverse of a matrix and preconditioning. Applied Numerical Mathematics1999; 30:291–303.

8. Huckle T. Efficient computations of sparse approximate inverses. Numerical Linear Algebra with Applications 1998;5:57–71.

9. Chan TF, van der Vorst HA. Approximate sand incomplete factorizations. Parallel Numerical Algorithms, Keyes DE,Sameh A, Venkatakrishnan V (eds.). ICASE/LaRC Interdisciplinary Series in Science and Engineering. Kluwer AcademicPublishers: Dordrecht, 1997; 167–202.

10. van der Vorst HA. A vectorizable variant of some ICCG methods. SIAM Journal on Scientific and Statistical Computing1982; 3:350–356.

11. Akcadogan C, Dag H. A parallel implementation of Chebyshev preconditioned conjugate gradient method. SecondInternational Symposium on Parallel and Distributed Computing. IEEE Press: New York, 2003; 1–8.

12. Benzi M, Meyer CD, Tuma M. A sparse approximate inverse preconditioner for the conjugate gradient method. SIAMJournal on Scientific Computing 1996; 17:1135–1149.

13. Bergamaschi L, Gambolati G, Pini G. A numerical experimental study of inverse preconditioning for the parallel iterativesolution to 3D finite element flow equations. Journal of Computational and Applied Mathematics 2007; 210:64–70.

14. Cosgrove JDF, Dias JC, Griewank A. Approximate inverse preconditioning for sparse linear systems. InternationalJournal of Computer Mathematics 1992; 44:91–110.

15. Dubois P, Greenbaum A, Rodrigue G. Approximating the inverse of a matrix for use in iterative algorithms on vectorprocessors. Computing 1979; 22:257–268.

16. Kolotilina LY, Yeremin AY. Factorized sparse approximate inverse preconditioning. SIAM Journal on Matrix Analysisand Applications 1993; 14:45–58.

17. Grote MJ, Huckle T. Parallel preconditioning with sparse approximate inverses. SIAM Journal on Scientific Computing1977; 18:838–853.

18. Akl SG. Parallel Computation: Models and Methods. Prentice-Hall: Englewood Cliffs, NJ, 1997.19. Grama A, Gupta A, Karypis G, Kumar V. Introduction to Parallel Computing. Addison-Wesley: Reading, MA, 2003.20. Gravvanis GA, Giannoutakis KM. Distributed finite element normalized approximate inverse preconditioning. Computer

Modeling in Engineering and Sciences 2006; 16(2):69–82.21. Giannoutakis KM, Gravvanis GA. A performance study of normalized explicit finite element approximate inverse

preconditioning on uniprocessor and multicomputer systems. Engineering Computations 2005; 23(3):192–217.22. Gravvanis GA. Parallel matrix techniques. Computational Fluid Dynamics, vol. I, Papailiou K, Tsahalis D, Periaux J,

Hirsch C, Pandolfi M (eds.). Wiley: New York, 1998; 472–477.


Design and implementation of parallel approximate inverse classes using OpenMP

Documents

Transcript of Design and implementation of parallel approximate inverse classes using OpenMP