A parallel Aitken-additive Schwarz waveform relaxation suitable for the grid

13
A parallel Aitken-additive Schwarz waveform relaxation suitable for the grid Hatem Ltaief a, * , Marc Garbey b a University of Tennessee, Department of Electrical Engineering and Computer Science, 414 Ferris Hall, 1508 Middle Drive, Knoxville, TN 37996-2100, USA b University of Houston, Department of Computer Science, 501 Philip G. Hoffman Hall, Houston, TX 77204-3010, USA article info Article history: Received 14 May 2008 Received in revised form 20 December 2008 Accepted 7 May 2009 Available online 18 May 2009 Keywords: Heat equation Domain decomposition Aitken-like acceleration Additive Schwarz Parabolic operators Grid computing abstract The objective of this paper is to describe a grid-efficient parallel implementation of the Ait- ken–Schwarz waveform relaxation method for the heat equation problem. This new paral- lel domain decomposition algorithm, introduced by Garbey [M. Garbey, A direct solver for the heat equation with domain decomposition in space and time, in: Springer Ulrich Langer et al. (Ed.), Domain Decomposition in Science and Engineering XVII, vol. 60, 2007, pp. 501– 508], generalizes the Aitken-like acceleration method of the additive Schwarz algorithm for elliptic problems. Although the standard Schwarz waveform relaxation algorithm has a lin- ear rate of convergence and low numerical efficiency, it can be easily optimized with respect to cache memory access and it scales well on a parallel system as the number of subdomains increases. The Aitken-like acceleration method transforms the Schwarz algo- rithm into a direct solver for the parabolic problem when one knows a priori the eigenvec- tors of the trace transfer operator. A standard example is the linear three dimensional heat equation problem discretized with a seven point scheme on a regular Cartesian grid. The core idea of the method is to postprocess the sequence of interfaces generated by the addi- tive Schwarz wave relaxation solver. The parallel implementation of the domain decompo- sition algorithm presented here is capable of achieving robustness and scalability in heterogeneous distributed computing environments and it is also naturally fault tolerant. All these features make such a numerical solver ideal for computational grid environments. This paper presents experimental results with a few loosely coupled parallel systems, remotely connected through the internet, located in Europe, Russia and the USA. Published by Elsevier B.V. 1. Introduction Computational grids offer an incredible amount of resources, geographically distributed, to scientists and researchers [1]. But the poor performance of the interconnection network is an unacceptable restricting factor for most parallel applications based on partial differential equations where intensive communications are widely used. Moreover, standard processors are becoming multi-cores and there is strong incentive to make use of all these parallel resources while avoiding conflicts in memory access [2]. This paper contributes to the general goal of developing new parallel algorithms suitable for computational grid con- straints and fitting the requirements of the next evolution in computing hardware. The objective of this paper is to present a grid-efficient parallel implementation of the Aitken-additive Schwarz wave- form relaxation method (AASWR) for the heat equation problem introduced in [3]. This new parallel domain decomposition 0167-8191/$ - see front matter Published by Elsevier B.V. doi:10.1016/j.parco.2009.05.001 * Corresponding author. Tel.: +1 865 974 9985; fax: +1 865 974 8296. E-mail addresses: [email protected] (H. Ltaief), [email protected] (M. Garbey). Parallel Computing 35 (2009) 416–428 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco

Transcript of A parallel Aitken-additive Schwarz waveform relaxation suitable for the grid

Parallel Computing 35 (2009) 416–428

Contents lists available at ScienceDirect

Parallel Computing

journal homepage: www.elsevier .com/ locate /parco

A parallel Aitken-additive Schwarz waveform relaxation suitablefor the grid

Hatem Ltaief a,*, Marc Garbey b

a University of Tennessee, Department of Electrical Engineering and Computer Science, 414 Ferris Hall, 1508 Middle Drive, Knoxville, TN 37996-2100, USAb University of Houston, Department of Computer Science, 501 Philip G. Hoffman Hall, Houston, TX 77204-3010, USA

a r t i c l e i n f o

Article history:Received 14 May 2008Received in revised form 20 December 2008Accepted 7 May 2009Available online 18 May 2009

Keywords:Heat equationDomain decompositionAitken-like accelerationAdditive SchwarzParabolic operatorsGrid computing

0167-8191/$ - see front matter Published by Elsevidoi:10.1016/j.parco.2009.05.001

* Corresponding author. Tel.: +1 865 974 9985; faE-mail addresses: [email protected] (H. Ltaief),

a b s t r a c t

The objective of this paper is to describe a grid-efficient parallel implementation of the Ait-ken–Schwarz waveform relaxation method for the heat equation problem. This new paral-lel domain decomposition algorithm, introduced by Garbey [M. Garbey, A direct solver forthe heat equation with domain decomposition in space and time, in: Springer Ulrich Langeret al. (Ed.), Domain Decomposition in Science and Engineering XVII, vol. 60, 2007, pp. 501–508], generalizes the Aitken-like acceleration method of the additive Schwarz algorithm forelliptic problems. Although the standard Schwarz waveform relaxation algorithm has a lin-ear rate of convergence and low numerical efficiency, it can be easily optimized withrespect to cache memory access and it scales well on a parallel system as the number ofsubdomains increases. The Aitken-like acceleration method transforms the Schwarz algo-rithm into a direct solver for the parabolic problem when one knows a priori the eigenvec-tors of the trace transfer operator. A standard example is the linear three dimensional heatequation problem discretized with a seven point scheme on a regular Cartesian grid. Thecore idea of the method is to postprocess the sequence of interfaces generated by the addi-tive Schwarz wave relaxation solver. The parallel implementation of the domain decompo-sition algorithm presented here is capable of achieving robustness and scalability inheterogeneous distributed computing environments and it is also naturally fault tolerant.All these features make such a numerical solver ideal for computational grid environments.This paper presents experimental results with a few loosely coupled parallel systems,remotely connected through the internet, located in Europe, Russia and the USA.

Published by Elsevier B.V.

1. Introduction

Computational grids offer an incredible amount of resources, geographically distributed, to scientists and researchers [1].But the poor performance of the interconnection network is an unacceptable restricting factor for most parallel applicationsbased on partial differential equations where intensive communications are widely used. Moreover, standard processors arebecoming multi-cores and there is strong incentive to make use of all these parallel resources while avoiding conflicts inmemory access [2].

This paper contributes to the general goal of developing new parallel algorithms suitable for computational grid con-straints and fitting the requirements of the next evolution in computing hardware.

The objective of this paper is to present a grid-efficient parallel implementation of the Aitken-additive Schwarz wave-form relaxation method (AASWR) for the heat equation problem introduced in [3]. This new parallel domain decomposition

er B.V.

x: +1 865 974 [email protected] (M. Garbey).

H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428 417

algorithm generalizes the Aitken-like acceleration method [4] of the additive Schwarz algorithm for elliptic problems [5–8]to parabolic problems. We focus here on the heat equation problem discretized on a Cartesian grid by finite volumes orfinite difference. However, this technique can be generalized easily to separable linear parabolic operators. The parallelimplementation for a computational grid of the AASWR for parabolic problems is more complex than its analogue withthe Aitken-like acceleration method of the additive Schwarz algorithm (AS) for elliptic problems [9,10]. For elliptic prob-lems, we have to focus on the way interface conditions are processed on a distributed computing environment. We takeadvantage of the spectral decomposition of the interface to optimize the communication scheme with respect to the dis-tance between subdomains [9]. For parabolic problems, the subdomains are now represented in space and time. The arti-ficial interfaces are in space and time as well. Therefore we have the flexibility in this decomposition to choose the timeinterval to accumulate the interface solutions, and we can efficiently introduce asynchronism to hide the expensive com-munication flows by the intensive computation. This is a remarkable feature of the AASWR algorithm that can be a directsolver for separable linear parabolic problems. We will call this grid-efficient technique the Parallel AASWR version(PAASWR).

The classical Schwarz method [11] has been extensively analyzed in the past, see for example the book by Smith et al. [12]and its references. Nowadays, the Schwarz method is used as an iterative Domain Decomposition (DD) method or a precon-ditioner for a Krylov method. The state of the art to speed up AS methods is typically based on either a coarse grid precon-ditioner [12,13] or the Optimization of the Transmission Conditions (OTC). We refer to [14–16] for a recent survey of theseOTC methods. Coarse grid preconditioner does not scale well on parallel architectures with a slow network unless the dis-crete problem is unnecessarily huge. To our knowledge, there are no OTC parallel implementations available that can scale ina grid computing environment.

The algorithm presented in this paper is a different and somewhat complementary approach to OTC. The fundamentalconcept of the method is to postprocess the sequence of interfaces generated by the domain decomposition solver. It canaccommodate any kind of interface boundary conditions provided that one knows, a priori, the eigenvectors of the tracetransfer operator. This is doable for the heat equation problem discretized on a regular Cartesian grids, but also it mightbe achieved numerically with embarrassing parallelism for separable linear operators [5,6].

Furthermore, the PAASWR method for parabolic problems is able to achieve three important features on the grid:

� Scalability: by solving each subdomain in an embarrassing parallel fashion,� Robustness: by exploiting very simple and systematic communication patterns, and� Fault tolerance: by exchanging only interface data solution at an optimal empirically defined time iteration interval.

PAASWR minimizes the number of messages sent thanks to the Aitken-like acceleration and is very insensitive to delaysdue to a high latency network. Let us note also that the subdomain solver can be optimized independently of the overallimplementation, which can make the PAASWR algorithm friendly to cache use and may provide performance portability.We will concentrate this paper on the parallel performance of our algorithm with a few loosely coupled parallel systems lo-cated in Europe, Russia and the USA.

Fault tolerance is an important component of grid computing, since geographically broad networks are not very reliableand are more subject to failures. It is particularly critical for evolution problems, as considered here, that may run for anextensive period of time. The asynchronous communication scheme described in our PAASWR implementation may recoverfrom short time failures of the network. But if an entire cluster of processors fails, we must reconstruct the correspondingsubdomain solution. Because PAASWR deals with space–time subdomains and scatter/gather space–time boundary condi-tions, this is easily feasible in principle as opposed to standard time stepping schemes that requires special recovery tech-niques as in [17]. This justifies the ‘‘naturally” fault tolerant appellation for the PAASWR algorithm.

The paper is organized as follows. Section 2 recalls the fundamental steps of the three dimensional space and timePAASWR algorithm for the heat equation. Section 3 describes the different approaches for the parallel implementation. Sec-tion 4 presents a comparison of the experimental results performed on a single parallel machine and on distributed gridcomputing. Finally, Section 5 summarizes the study and defines the future work plan.

2. The PAASWR algorithm

2.1. Definition of the problem and its discretization

First, let us assume that the domain X is a unit cube discretized by a Cartesian grid with arbitrary space steps in eachdirection. Let us consider the Initial Boundary Value Problem (IBVP):

@u@t¼ L½u� þ f ðx; y; z; tÞ; ð1Þ

uðx; y; z;0Þ ¼ uoðx; y; zÞ; ð2Þuð0; y; z; tÞ ¼ aðy; z; tÞ; uð1; y; z; tÞ ¼ bðy; z; tÞ; ð3Þuðx;0; z; tÞ ¼ cðx; z; tÞ; uðx;1; z; tÞ ¼ dðx; z; tÞ; ð4Þ

418 H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428

uðx; y; 0; tÞ ¼ eðx; y; tÞ; uðx; y;1; tÞ ¼ f ðx; y; tÞ; ðx; y; z; tÞ 2 X ¼ ð0;1Þ3 � ð0; TÞ: ð5Þ

where L is a separable second order linear elliptic operator. We assume that the problem is well posed and has a unique solu-tion. We introduce the following discretization in space

0 ¼ x0 < x1 < � � � < xNx�1 < xNx ¼ 1; hxj ¼ xj � xj�1;

0 ¼ y0 < y1 < � � � < yNy�1< yNy

¼ 1; hyj ¼ yj � yj�1;

0 ¼ z0 < z1 < � � � < zNz�1 < zNz ¼ 1; hzj ¼ zj � zj�1;

and time

tn ¼ ndt; n 2 f0; . . . ;Mg; dt ¼ TM:

The separable second order linear elliptic operator L is for the heat equation:

L ¼ L1 þ L2 þ L3 with L1 ¼ @xx; L2 ¼ @yy; L3 ¼ @zz:

We write the discretized problem as follows

Unþ1 � Un

dt¼ Dxx½Unþ1� þ Dyy½Unþ1� þ Dzz½Unþ1� þ f ðX; Y; Z; tnþ1Þ; n ¼ 0; . . . ;M � 1; ð6Þ

with appropriate boundary conditions corresponding to (3)–(5). Dxx should be the approximation of L1 and we use either sec-ond order central finite differences or finite volume. Dyy and Dzz are defined in a similar manner. The time stepping is a firstorder implicit scheme but, higher order time stepping schemes can be considered.

Introducing the matrices U ¼ ðU1; . . . ;UMÞ that correspond to the discrete solution at time level ðt1; . . . ; tMÞ, we can writethe time integration scheme (6) as a large linear system:

AU ¼ F: ð7Þ

We are going to recall now the additive Schwarz waveform relaxation algorithm (ASWR).

2.2. ASWR algorithm

The spatial domain X ¼ ð0;1Þ3 is decomposed into q overlapping strips Xi ¼ ðXil;X

irÞ � ð0;1Þ

2; i ¼ 1; . . . ; q with

X2l < X1

r < X3l < X2

r ; . . . ;Xql < Xq�1

r as in Fig. 1.In the ASWR algorithm, the corresponding left and right (artificial) interfaces of each subdomain write

Yi;l=r ¼ fXil=rg � ð0;1Þ

2 � ð0; TÞ:

The mþ 1 iteration of the ASWR algorithm writes

# Loop over the number of subdomainsfor i ¼ 1; . . . ; q

# Compute the subdomain solutionsAiVi

mþ1 ¼ Fi; in Xi � ð0; TÞ# Update the LEFT interface solutionsVi

mþ1ðYi;lÞ ¼ Vi�1

m ðYi;lÞ

Fig. 1. Domain Decomposition in 3D space with 3 overlapping subdomains.

H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428 419

# Update the RIGHT interface solutionsVi

mþ1ðYi;rÞ ¼ Viþ1

m ðYi;rÞ

end

where Ai is the sub-block of A that corresponds to the discretization of the IBVP problem and Vimþ1 is the discrete solution at

iteration mþ 1 for the subdomain Xi � ð0; TÞ: To initiate this procedure, one provides an initial condition Vi;l=r0 at the artificial

interfaces Yi;l=r respectively.This algorithm generates a sequence of two dimensional matrices Wm ¼ ðV2;l

m ;V1;rm ;V3;l

m ;V2;rm ; . . . ;Vq;l

m Þ corresponding to theboundary values ðY2;l;Y1;r ;Y3;l;Y2;r ; . . . ;Yq;lÞ at the iterate m.

The proof of convergence of the additive Schwarz waveform relaxation on the continuous problem (5) with the heatequation given in [18] is based on the maximum principle. The convergence of the ASWR algorithm at the discrete levelfollows from a discrete maximum principle as well and applies, for example, to the classical seven points finite differencescheme with the three dimensional heat equation problem. Because the parabolic problem (5) is linear, the trace transferoperator

T : Wm �W1 !Wmþ1 �W1 ð8Þ

is linear.In the next section, to simplify the presentation, we will deal with homogeneous Dirichlet boundary conditions. The case

of non-homogeneous boundary conditions is handled by a shift using the superposition principle as in [19].

2.3. Diagonalization of the trace transfer operator

We introduce the following expansions for the discrete solution using the sine base functions that satisfy homogeneousDirichlet boundary conditions:

UnðX; Y; Z; tÞ ¼XMy

j¼1

XMz

k¼1

Knj;kðX; tÞsinðjYÞ sinðkZÞ; ð9Þ

uoðX;Y; ZÞ ¼XMy

j¼1

XMz

k¼1

kj;kðXÞ sinðjYÞ sinðkZÞ; ð10Þ

and f nðX; Y; Z; tÞ ¼XMy

j¼1

XMz

k¼1

f nj;kðX; tÞ sinðjYÞ sinðkZÞ: ð11Þ

My and Mz are the number of modes in the Y and Z direction. By plugging the discrete solution (9)–(11) into (6), we end upwith the following set of independent one dimensional My �Mz problems based on the Helmholtz operator:

Knþ1j;k �Kn

j;k

dt¼ Dxx½Knþ1

j;k � � ðlj þ lkÞKnþ1j;k þ f nþ1

j;k ðX; tÞ; n ¼ 0; . . . ;M � 1; ð12Þ

K0j;k ¼ kj;kðXÞ: ð13Þ

lj and lk are respectively the eigenvalues of Dyy and Dzz. (12) is used to diagonalize the trace transfer operator, butnot necessarily for the subdomain processing, since a multigrid method might be faster than a spectral decompositionmethod.

In the subspace corresponding to the eigenvector sinðjYÞ sinðkZÞ; the AASWR algorithm generates a sequenceof vectors

cW mj;k ¼ ðK

mj;kðX

2l ; t

1; . . . ; tMÞ;Kmj;kðX

1r ; t

1; . . . ; tMÞ; . . . ;Kmj;kðX

ql ; t

1; . . . ; tMÞ;Kmj;kðX

q�1r ; t1; . . . ; tMÞÞ ð14Þ

corresponding to the boundary values of the m iterate Kmj;k on the set

S ¼ ðX2l ;X

1r ;X

3l ;X

2r ; . . . ;Xq

l ;Xq�1r Þ � ðt1; . . . ; tMÞ:

The trace transfer operator is decomposed into My �Mz independent trace transfer operators as well and is defined asfollows:

cW nj;k � cW1

j;k ! cW nþ1j;k � cW1

j;k:

420 H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428

Let Pj;k be the matrix of this linear operator. Pj;k has the following pentadiagonal structure:

0 Prj;kð1Þ 0 0 � � �

Pl;lj;kð2Þ 0 0 Pl;r

j;kð2Þ � � �

Pr;lj;kð2Þ 0 0 Pr;r

j;kð2Þ � � �

� � � Pl;lj;kðq� 1Þ 0 0 Pl;r

j;kðq� 1Þ

� � � Pr;lj;kðq� 1Þ 0 0 Pr;r

j;kðq� 1Þ

� � � 0 0 Plj;kðqÞ 0

�������������������

�������������������

Pj;k is a matrix of size ð2ðq� 1ÞðM � 1ÞÞ2 with the block Pl;l

j;kðiÞ; Pl;rj;kðiÞ; P

r;lj;kðiÞ; P

r;rj;kðiÞ that are square matrices of size ðM � 1Þ2 for

subdomain (i). Let us use the same generic notation Id for the matrix of the identity operator no matter the dimension of thematrix. If the matrix Pj;k is known and the matrix Id� Pj;k is regular, one step of the ASWR provides enough information toreconstruct the exact interface values by solving the linear system

ðId� Pj;kÞcW1j;k ¼ cW 1

j;k � Pj;kcW 0

j;k; 8j 2 f1; . . . ;Myg; 8k 2 f1; . . . ;Mzg; ð15Þ

Pj;k can also be computed prior to the PAASWR with embarassing parallelism and once for all.

Remark 1. for a separable parabolic operator, the element of Pj;k can be computed numerically following the method of [5,6].

Remark 2. The domain decomposition is one dimensional here. Multi dimensional decomposition can be achieved with themultilevel domain decomposition version of the algorithm as in [19].

In the next section, we can now describe the major components of the algorithm.

2.4. General algorithm

The general PAASWR algorithm can then be summarized in three steps:

� Step A: Compute the 1st iterate of ASWR for the 3D parabolic problem (1).� Step B: Apply the generalized Aitken acceleration to compute the exact internal boundary conditions W1 that is:

(Step B.1) Expand the trace of the solution in the eigenvectors basis,(Step B.2) Solve the linear problem component wise (Aitken-like acceleration) (15),(Step B.3) Assemble the boundary conditions.

W1 ¼XMy

j¼1

XMz

k¼1

cW1j;k sinðjYÞ sinðkZÞ:

� Step C: Compute the 2nd iterate using the exact boundary value W1:

The computation of each subdomain in the steps A and C can be processed by any linear solvers of choice (Multigrid, Kry-lov, etc.). In the present implementation, we use a combination of sine transform in y=z directions and LU decomposition in xdirection. Step B involves dependencies between subdomains and needs particular attention for a distributed computingenvironment. In the following section, we will go over four approaches for the parallel implementation.

3. Description of the parallel implementation

In this section, we demonstrate how the PAASWR algorithm is able to bypass the limitations of the grid discussed in Sec-tion 1. We describe the parallel algorithm using the fundamental Message Passing Interface (MPI) paradigm [20,21] on a dis-tributed network of processors and show that the new method provides a rigorous framework to optimize the use of parallelarchitectures.

3.1. The concept of subdomain solver and interface solver

From the different steps highlighted in Section 2.4, we determine two process groups: the subdomain solver (SS) pro-cesses and the interface solver (IS) processes.

The SS processes work on the computation of the solution of the IBVP (5) in each subdomain Xi � ð0; TÞ; i ¼ 1 . . . q with qthe number of subdomains (Step A and C) in an embarrassingly parallel fashion. The spectral decomposition/recompositionof the interface, respectively Step B.1 and B.3, is performed in parallel as well. The IS processes execute the Aitken-like accel-eration and solve the interface problem (Step B.2). They could also be assigned additional tasks such as solution monitoring

Fig. 2. Example of processes distribution with three machines.

H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428 421

or visualization while the application is running. They will therefore need to receive the computed solution from the SS pro-cesses at a small frequency. This can be asynchronously achieved and will only require local communicationsbetween SS and IS processes located in the same machine. The performance of such a communication scheme has beenpresented in [17].

Fig. 2 draws this framework with three parallel machines. We have one dimensional processor topology implemented ineach process group. There are no local neighborhood communications within the groups and the communications are onlyestablished between the SS and the IS groups. We do not need as many IS processes as SS processes since the main time-consuming task (the subdomain computation) is performed by the SS processes. For example, in Fig. 2, each machine ownsfour processes, three SS processes for a single IS process.

In the next section, we justify the naturally fault tolerant appellation for the PAASWR algorithm and discuss different faulttolerant issues.

3.2. A naturally fault tolerant algorithm

The main duties of the IS processes are to solve the interface systems, to run an a posteriori error estimate and to receivecheckpoints of the subdomain solutions (that can be also visualized if necessary). In case of process failures, we assume thatthe Fault Tolerant Runtime System Environment (FTRSE) will detect it and will spawn new processes to replace the failedones. Two data structures are decisive for the recovery procedure: the subdomain data solutions and the interface condi-tions. In [17], we described two different numerical schemes to achieve a complete recovery after a failure:

� The forward implicit scheme, in which the subdomain solution has to be checkpointed each K time steps whereas theboundary conditions have to be stored at each time step from the last checkpoint to the failure.

� The backward explicit scheme, where only the subdomain solution has to be checkpointed each K time steps whereas thesubsequent boundary conditions are computed by the neighbors of the failed process and provided to the freshly restartedprocess.

The cost of checkpointing the subdomain solutions is negligible and we refer to [17] for performance details. On the otherhand, by receiving and working on the interface conditions, the IS group checkpoints indirectly the application. Then, we candistinguish three main failure scenarios:

� In case of some SS process failures, the lost subdomain solutions can be easily retrieved on the fly. First, the IS processesthat hold the interface conditions and the checkpointed subdomain solutions (used also for the visualization) provide thisinformation to the newly restarted SS processes. Second, those new SS processes can use the same subdomain solver torebuild the lost data, since the corresponding internal boundary conditions are known. The application then recovers thesame numerical solution by applying the forward implicit scheme.

� If some IS processes fail, the lost subdomain solutions and internal boundary conditions are still available in the localmemory of the corresponding SS processes. The freshly restarted IS processes receive from the corresponding SS processesthe original subdomain solutions and interface conditions, same as before the failure occurs.

� If both the SS and IS processes that hold the same critical data fail, we may use the backward explicit scheme here withless overhead. Indeed, the interface solutions do not need to be retrieved by computation and are naturally availablethanks to the PAASWR algorithm. As a matter of fact, no numerical error is further introduced in the standard scheme.Furthermore, the restriction for K ¼ 9 between two consecutive checkpoints, as mentionned at the end of Section 4.2in [17], can be relaxed and checkpoints can be taken less frequently.

422 H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428

In the next section, we describe different parallel implementation approaches that are built upon this structure of groups.

3.3. The parallel implementation approaches

Four approaches have been developed to take advantage of the SS and IS process concept. Moving from the basic versionto the more sophisticated one, we present the pros and cons of each method.

3.3.1. The Blocking versionThe Blocking version is a standard approach to start with. It is composed of five successive stages:

(1) The SS processes perform the first ASWR iteration in parallel on the entire domain (space and time).(2) The SS processes send the interface solution to the unique IS process.(3) The IS process solves the interface problem.(4) The IS process sends back the new inner boundary values to the corresponding SS processes.(5) The SS processes compute the second iterate using the exact boundary values.

While this algorithm is very easy to implement, it does not scale when the number of SS processes increases, since theinterface problem size becomes larger. In addition to that, the Blocking sends in stages (2) and (4) do not allow any overlapsbetween computation and communication sections. It is also important to notice that the acceleration of the interface com-ponents at time step tn1 does not depend on the interface components at later time steps tn2 ; n2 > n1. This property is re-flected in the fact that the blocks of the matrix P are lower tridiagonal matrices. The resolution of the linear system inthe acceleration step can progress with the time stepping on the interface solutions. Therefore, the interface solutions shouldbe sent as soon as they are computed.

The next approach resolves these issues by splitting the total number of time steps in equal size subsets or Time Windows(TWs).

3.3.2. The Blocking Windowed versionThis method is the analog of the Blocking version with five stages as in Section 3.3.1. However, it is applied on TWs, which

are a subset of the total number of time steps. Fig. 3 shows in two space dimension and time how the procedure operatesover the five stages with four SS processes and a single IS process. The vertical dashed lines within the rectangles correspondto the interface solutions.

Let us denote by MTW the number of time step of a specific TW. With this technique, there is an interface problem to solveat each TW but their sizes are M=MTW times smaller than the whole interface problem size generated in the Blocking versionapproach.

Fig. 3. The Blocking Windowed version.

H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428 423

Furthermore, the LU Decomposition needs only to be computed for the first TW since the matrix of the interface problemis time independent. For the next TWs, the update of the right-hand side with the previous interface solutions is sufficient toget the correct interface solutions.

On the other hand, the Blocking communications seem to be the main bottleneck left. Indeed, the five stages are per-formed successively, which wastes CPU cycles on both groups in waiting for the interface data to come.

In the next approach, we introduce a feature to shorten the process idle time.

3.3.3. The Non-Blocking Windowed versionA new layer based on non-Blocking communications is added to the previous versions. This introduces a significant over-

lap between communications of subdomain boundaries by computations of subdomain solutions. We end up with a pipe-lining strategy, which makes this method more efficient. For example, in Fig. 4, while the IS process is working on theStage 3 of the first TW, the SS process has started Stage 1 of the second TW. This continues throughout the time integrationuntil reaching the last TW for which a special treatment is necessary due to the pipelining technique.

On the IS process side, the unique process available receives many messages at the same time and is, therefore, highlycongested as the number of SS processes increases. The aim of the next version is to allocate more IS processes to handlefirst these simultaneous communications and second, to parallelize the interface problem resolution.

3.3.4. The Parallel Non-Blocking Windowed versionBesides the parallelization in space coming from the domain decomposition in the SS group, this version appends a sec-

ond level of parallelization in the IS group for the interface problem resolution. This is possible thanks to the nature of theinterface problem. Indeed, all eigenvector components are given by the solution of a set of My �Mz parabolic problems com-pletely decoupled. One can then distribute straightaway these problems on a set of IS processes, for example, by allocating toeach one My=nIS �Mz problems, nIS being the number of IS processes.

Fig. 5 presents the new framework. This method is very efficient by benefiting the most from the SS/IS concept. It achievesalso robustness by exploiting very simple and systematic communication patterns. At the same time, it is very challenging toset it up, since many ongoing communications have to be cautiously managed. Fig. 6 highlights the critical stages (2 and 4)where communications are involved, as described in Section 3.3.1. For simplicity’s sake, we did not represent all the com-munication flows between the two process groups and we restricted ourself to a single machine representation.

For instance, in Fig. 6a, the SS process 0 distributes equally its subdomain interface to all IS processes. This operation actu-ally corresponds to the collective communication called MPI SCATTER where the current SS process and all the processesfrom the IS group are involved. The same communication scheme is repeated for each SS process. We have not integratedsuch functionality since collective operations are Blocking and it may dramatically slow down overall performance on a par-allel machine when running with many processes, let alone on a grid. Therefore, we have instead developed our own non-Blocking communication schemes which offer more opportunities to overlap communication by computation.

Fig. 4. The Non-Blocking Windowed version.

Fig. 5. The Parallel Non-Blocking Windowed version.

Fig. 6. Communication patterns.

424 H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428

Conversely, in Fig. 6b, the IS process 0 sends back the new computed interface solutions to all SS processes. In fact thisoperation corresponds to the collective communication called MPI GATHER. The same communication scheme is repeatedfor each IS process. Again, for performance purposes, we did not take this direction and rather preferred to implement itin a non-Blocking manner.

Further, the TW size Mopt must be estimated empirically to get optimal performance. Indeed, we need to find a compro-mise between the subdomain sizes in space and the number of time steps per TW to minimize the idle process time betweenthe SS and IS groups.

H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428 425

In the next section, we present experimental results of the four approaches depending on the TW size.

4. Results

In this section, we report results of our experiments with the four different methods. The first tests were performed on asingle parallel system to select the best approach to be used later for the grid. At the same time, we determined empiricallythe ideal Mopt to reduce the idle time. Each SS process is in charge of solving one subdomain and the ratio of SS/IS processes isfixed to 2:1 for the Parallel Non-Blocking Windowed version.

4.1. Tests on a single parallel machine

The experiments were performed on a 24 SUN X2100 nodes, each with 2.2 GHz dual core AMD Opteron processor and2 GB main memory, connected through a Gigabit Ethernet Network. The elapsed time in seconds is shown for each approachin Fig. 7a–d compared to the sequential code. The application runs for 25 or 26 time steps depending on the TW size, and theglobal mesh for the spatial discretization is a cube of dimension 96 in each direction. The Blocking approach does not scale atall when the number of subdomains increases. Dealing with all the time steps of the entire space–time domain generates avery large interface problem that is difficult to solve at the end with a unique IS process. By introducing TWs, the perfor-mance of the Blocking Windowed version scales better. In principle the Blocking communication should be a limiting factorfor performance. This may not be true on a parallel system with a high network speed. In our experiment, with the hardwareconfiguration described above, the Non-Blocking Windowed version gives performance results very close to the Blocking

2 4 8 160

20

40

60

80

100

120

140

160

180

200Preliminary Results ; M = 25 ; Window Size =3 ; Global size = 96x96x96

Number Of SS Processes

Exec

utio

n Ti

me

in s

econ

ds

SequentialBlockingBlockingWindowedNonBlockingWindowedParaNonBlockingWindowed

2 4 8 160

20

40

60

80

100

120

140

160

180

200Preliminary Results ; M = 25 ; Window Size =4 ; Global size = 96x96x96

Number Of SS Processes

Exec

utio

n Ti

me

in s

econ

ds

SequentialBlockingBlockingWindowedNonBlockingWindowedParaNonBlockingWindowed

2 4 8 160

20

40

60

80

100

120

140

160

180

200Preliminary Results ; M = 25 ; Window Size =5 ; Global size = 96x96x96

Number Of SS Processes

Exec

utio

n Ti

me

in s

econ

ds

SequentialBlockingBlockingWindowedNonBlockingWindowedParaNonBlockingWindowed

2 4 8 160

20

40

60

80

100

120

140

160

180

200Preliminary Results ; M = 25 ; Window Size =6 ; Global size = 96x96x96

Number Of SS Processes

Exec

utio

n Ti

me

in s

econ

ds

SequentialBlockingBlockingWindowedNonBlockingWindowedParaNonBlockingWindowed

Fig. 7. Experimental results.

426 H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428

Windowed version regardless of the TW size. The true limiting factor here is the size of the interface problem, which is toolarge to be solved by only one IS process.

In this context, the parallelization of the interface problem appears to be inevitable. The Parallel Non-Blocking Windowedapproach is the most efficient method and is capable of solving approximately one million unknown problems with 25 timesteps in 4.56 s with a window size of three. The application may not be affected by dealing with sendings and receivings ofsmall messages so often, i.e. each three time steps. The network of the parallel system seems to handle frequent and smallmessages ðMTW ¼ 3Þ better than less frequent and large messages ðMTW ¼ 6Þ.

In the following section, we experiment with the selected Parallel Non-Blocking Windowed algorithm on the grid and seewhether a TW size of three time steps is also applicable for distributed computing environments.

4.2. Tests on the distributed grid

The computer platform for our experiment was as follows. We distributed the domain of computation among three dif-ferent parallel computers: the Itanium 2 cluster ð1:3 GHzÞ Atlantis in Houston (USA), the Xeon EM64T cluster ð3:2 GHzÞ Cacauin Stuttgart (Germany) and the Itanium 2 cluster ð1:6 GHzÞ Cluster150 in Moscow (Russia). This grid of computers is com-pletely heterogeneous. One can notice that while this ‘‘grid” of computers is only comprised of three clusters of nodes, itis still a hardware configuration that contains the standard difficulties working with the grid. The network that interconnectsthese three systems, located in three different countries, is the ordinary network with high latency and low bandwidth. Thenetwork performance was determined with the Ping-Pong benchmark, which evaluates point-to-point message passingoperations (a send and a recv) on a pair of processes. Table 1 gives the minimal and maximal Ping-Pong bandwidth andlatency.

This low performance of the network between sites is in general the main limitation to grid computing efficiency. Manyscientific applications require heavy communications between peers and therefore cannot run properly on such a grid envi-ronment. It is certainly true that higher bandwidth can be achieved with special connections between remote sites but it isunrealistic to assume better latency time from one system to another when these systems are hosted by individual labora-tories, as it is often the case. We have then to deal with networks in grid computing that have latency of the order of 1000times larger than standard parallel systems.

The application runs for the same number of time step as before, i.e. 25 or 26 time steps depending on the TW size. Onemust load balance the size of the domains on each system. Because of the simplicity of the mesh and the nature of the solverthat is a direct solver, we can simply use static load balancing as in [22]. Table 2 presents the data partitioning on each hostper subdomain along the X space direction. Three different global sizes are presented: small ð90� 72� 72Þ, mediumð120� 72� 72Þ, and large ð170� 72� 72Þ. Cacau obtains the largest data allocation and seems to be the fastest machine.

To improve the performance of this implementation on the grid, there is one additional numerical tool that must be used.The solution of the heat equation problem is well-known to be very smooth. In addition, it is inherent to the PAASWR algo-rithm for our discrete problem that the interfaces to be sent over the network are represented in Fourier space. Because theFourier expansion is spectral accurate as opposed to the second order finite difference method, we can filter out half of thehigh frequency modes of the interface in each space direction without any numerical penalty on the overall accuracy. To bemore specific, in the runs corresponding to Table 2, the number of Fourier coefficients sent for each interface isMTW � 36� 36: This technique further lowers the communication overhead between remote sites due to the slow network

Table 1The network performance between machines.

Bonds Bandwidth (Kbits/s) Latency (ms)

Atlantis-Cacau 304–373 62.5–66.4Cacau-Cluster150 546–808 27.3–29.3Cluster150-Atlantis 71–127 83–84

Table 2Local subdomain grid sizes in X space direction.

Mopt 3 4 5 6

Cacau small 32 33 32 33Cluster150 small 30 28 29 27Atlantis small 28 29 29 30Cacau medium 46 47 47 49Cluster150 medium 44 41 42 38Atlantis medium 40 42 41 43Cacau large 60 62 62 65Cluster150 large 59 54 55 50Atlantis large 51 54 53 55

H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428 427

interconnect. We could even decrease the ratio of SS/IS processes, compare to what we have in our experiment here, sincethe interface problems solved by the IS processes are now smaller. Fig. 8a–c represent the execution time in seconds with thedifferent data grid sizes depending on the TW size. The number of subdomains is equally divided among the machines. We

3 4 5 60

5

10

15

20

25

30

35

Time Window Size

Exec

utio

n Ti

me

in S

econ

ds

W/O Comm SmallW Comm SmallW/O Comm MediumW Comm MediumW/O Comm LargeW Comm Large

3 4 5 60

5

10

15

20

25

30

35

Time Window Size

Exec

utio

n Ti

me

in S

econ

ds

W/O Comm SmallW Comm SmallW/O Comm MediumW Comm MediumW/O Comm LargeW Comm Large

3 4 5 60

5

10

15

20

25

30

35

Time Window Size

Exec

utio

n Ti

me

in S

econ

ds

W/O Comm SmallW Comm SmallW/O Comm MediumW Comm MediumW/O Comm LargeW Comm Large

Fig. 8. PAASWR execution time in seconds.

SMALL MEDIUM LARGE0

2

4

6

8

10

12

14

16

18

DATA SIZE

Exec

utio

n Ti

me

in S

econ

ds

6 subdomains12 subdomains24 subdomains

Fig. 9. Scalability.

428 H. Ltaief, M. Garbey / Parallel Computing 35 (2009) 416–428

show also the overall performance when dis/enabling communications between the SS and IS groups. One can notice that thehigher the TW size, the lower the communication time and thus, the better the execution time. PAASWR seems to be com-putationally very efficient. For example, it takes only 16.3 s to solve 24 subdomains on the grid with approximately a totalnumber of 7 million unknowns and 25 time steps. The overhead due to communications on the network with the optimumperformance, i.e. with large TW size, is completely hidden and can be practically neglected.

Fig. 9 presents the nice scalability with the selected optimal TW size ðMoptÞ of six time steps. The elapsed time staysroughly identical when doubling the number of SS processes and keeping the same local subdomain sizes. While the heatequation test case is particularly simple, our implementation with the PAASWR seems to be the first technique available to-day, up to our knowledge, that combines scalability with high numerical efficiency on the grid.

5. Conclusions

In this paper, we have described how PAASWR can achieve scalability, robustness and fault tolerance under grid environ-ments. The Parallel Non-Blocking Windowed methodology is the most efficient among all the techniques we tried. Indeed,the identification of five successive stages in the general algorithm permits us to apply a pipelining strategy and thereforetakes advantage of the SS/IS process concept. The parallelization of the interface problem resolution makes PAASWR scalableas the number of subdomains increases. Furthermore, PAASWR is naturally fault tolerant, and in case of failures can restartthe computation from the subdomain solutions and the interface solutions, located in the IS process main memory. Theapplication will then terminate as if no failures occurred.

A critical step in the generalization of this study would be to extend the algorithm to more complex approximationframeworks and parabolic operators. The companion paper [23] describes the Steffensen–Schwarz variant of the algorithmthat applies to non-linear or non-separable problems. The performance of the numerical algorithm depends essentially onhow well the trace transfer operator can be approximated by a linear one with known eigenvectors. This question is not triv-ial, in particular for unstructured finite element approximation. The PAASWR may thus have some potential in this situationas a preconditioner.

Acknowledgement

The authors thank the reviewers for their insightful comments, which greatly helped to improve the quality of this article.

References

[1] I. Foster, C. Kesselman (Eds.), The Grid 2: Blueprint for a New Computing Infrastructure, The Elsevier Series in Grid Computing, San Francisco, 2004.[2] Jack Dongarra, Dennis Gannon, Geoffrey Fox, Ken Kennedy, The impact of multicore on computational science software, CTWatch Quarterly 3 (1)

(2007).[3] Marc Garbey, A direct solver for the heat equation with domain decomposition in space and time, in: Springer Ulrich Langer et al., (Ed.), Domain

Decomposition in Science and Engineering XVII, vol. 60, 2007, pp. 501–508.[4] P. Henrici, Elements of Numerical Analysis, John Wiley and Sons Inc., 1964.[5] J. Baranger, M. Garbey, F. Oudin, Recent development on Aitken–Schwarz method, Domain Decomposition Methods in Science and Engineering XIII

(2002) 289–296.[6] J. Baranger, M. Garbey, F. Oudin-Dardun, Acceleration of the Schwarz method: the cartesian grid with irregular space step case, SIAM Journal on

Scientific Computing 30 (5) (2008) 2566–2586.[7] Marc Garbey, Acceleration of the Schwarz method for elliptic problem, SIAM Journal of Scientific Computing 26 (6) (2005) 1871–1893.[8] Marc Garbey, Damien Tromeur-Dervout, On some Aitken like acceleration of the Schwarz method, International Journal for Numerical Methods in

Fluids 12 (40) (2002) 1493–1513.[9] Nicolas Barberou, Marc Garbey, Matthias Hess, Michael M. Resch, Tuomo Rossi, Jari Toivanen, Damien Tromeur-Dervout, Efficient metacomputing of

elliptic linear and non-linear problems, Journal of Parallel and Distributed Computing 63 (5) (2003) 564–577.[10] M. Garbey, B. Hadri, W. Shyy, Fast elliptic solver for incompressible Navier Stokes flow and heat transfer problems on the grid, in: Forty Third

Aerospace Sciences Meeting and Exhibit Conference, 2005.[11] H.A. Schwarz, Gesammelte mathematische abhandlungen 2, Vierteljahrsschrift der Naturforschenden Gesellschaf 15 (1980) 133–143.[12] B. Smith, P. Bjorstad, W. Gropp, Domain Decomposition Parallel Multilevel Methods for Elliptic Partial Differential Equations, Cambridge University

Press, 1996.[13] S. Balay, W.D. Gropp, L.C. McInnes, B.F. Smith, Petsc users manual, Tech Rep. ANL-95/11, Revision 2.1.5, Argonne Nat. Lab., 2003.[14] Y. Achdou, C. Japhet, Y. Maday, F. Nataf, A new cement to glue non-conforming grids with robin interface conditions: the finite volume case,

Numerische Mathematik 92 (4) (2002) 593–620.[15] M. Gander, F. Magoules, F. Nataf, Optimized Schwarz methods without overlap for the helmholtz equation, SIAM Journal on Scientific Computing 24 (1)

(2002) 36–60.[16] F. Magoules, F. Nataf, Optimized Schwarz methods: a general presentation, Domain Decomposition Methods: Theory and Applications 25 (2006) 1–36.[17] Hatem Ltaief, Edgar Gabriel, Marc Garbey, Fault tolerant algorithms for heat transfer problems, Journal of Parallel and Distributed Computing 68 (5)

(2008) 663–677.[18] M. Gander, H. Zhao, Overlapping Schwarz waveform relaxation for the heat equation in n-dimensions, BIT 40 (4) (2000) 001–004.[19] Marc Garbey, Damien Tromeur-Dervout, Aitken-Schwarz method on Cartesian grids, in: N. Debit, M. Garbey, R. Hoppe, J. Périaux, D. Keyes, (Eds.),

Proceedings of International Conference on Domain Decomposition Methods DD13, CIMNE, 2002, pp. 53–65.[20] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, University of Tennessee, Knoxville, TN, June 1995.[21] MPI Forum, Special Issue: MPI2: A Message-Passing Interface Standard, International Journal of Supercomputer Applications and High Performance

Computing, Spring–Summer, 12 (1–2) (1998) 1–299.[22] Hatem Ltaief, Rainer Keller, Marc Garbey, Michael Resch, A grid solver for reaction-convection-diffusion operators, University of Houston Pre-Print

(UH-CS-07-08), Journal of High Performance Computing Applications.[23] Marc Garbey, Acceleration of a Schwarz waveform relaxation method for parabolic problems. Preprint UH-CS-06-11, September 2006.