A Scalable Parallel Algorithm for Sparse Cholesky Factorization

A Scalable Parallel Algorithm for Sparse Cholesky FactorizationAnshul Gupta Vipin KumarDepartment of Computer Science, University of Minnesota, Minneapolis, MN 55455AbstractIn this paper, we describe a scalable parallel algorithm forsparse Cholesky factorization, analyze its performance and scal-ability, and present experimental results of its implementationon a 1024-processor nCUBE2 parallel computer. Through ouranalysis and experimental results, we demonstrate that our al-gorithm improves the state of the art in parallel direct solu-tion of sparse linear systems by an order of magnitude|both interms of speedups and the number of processors that can be uti-lized eectively for a given problem size. This algorithm incursstrictly less communication overhead and is more scalable thanany known parallel formulation of sparse matrix factorization.We show that our algorithm is optimally scalable on hypercubeand mesh architectures and that its asymptotic scalability isthe same as that of dense matrix factorization for a wide classof sparse linear systems, including those arising in all two- andthree- dimensional nite element problems.1 IntroductionThis paper describes a parallel algorithm for sparseCholesky factorization that is optimally scalable fora wide class of sparse linear systems of practical in-terest. This algorithm incurs strictly less communi-cation overhead than any known parallel formulationof sparse matrix factorization, and hence, can uti-lize a higher number of processors eectively. It iswell known that dense matrix factorization can beimplemented eciently on distributed-memory par-allel computers [7, 23, 18]. In this paper, we showthat the parallel Cholesky factorization algorithm de-scribed here is as scalable as the best parallel formu-lation of dense matrix factorization on both mesh andhypercube architectures. We also show that our al-gorithm is equally scalable for sparse matrices arisingfrom two- and three-dimensional nite element prob-lems.The performance and scalability analysis of our al-gorithm is supported by experimental results on upto 1024 processors of the nCUBE2 parallel computer.This work was supported by Army Research Oce un-der contract # DA/DAAH04-93-G-0080 and by the Univer-sity of Minnesota Army High PerformanceComputingResearchCenter under contract # DAAL03-89-C-0038. Access to thenCUBE2 at Sandia National Labs was provided via DeSRAconsortium.

Ω 3 3ΩScalability: ((p log(p)) )

Ω Ω (Np log(p)) (Np)

Scalability:

Mapping: Global

(p )

Mapping: Subtree-subcube

Mapping: Subtree-subcube

1.5 1.5 (p )Θ3 (p (log(p)) )

0.5Ω Θ 0.5

Mapping: Global

(Np log(p))

Scalability:

A

Scalability:

B

Ω

D

(Np )

C

Global Mapping Subtree-to-Subcube Mapping

Partitioning

in Both

Dimensions

Columnwise

Partitioning

Partitioning: 1-DPartitioning: 1-D

Communication overhead:Communication overhead:

Partitioning: 2-DPartitioning: 2-D

Communication overhead:Communication overhead:Figure 1: An overview of the perfomance and scala-bility of parallel algorithms for factorization of sparsematrices resulting from two-dimensional N -node gridgraphs. Box D represents our algorithm.We have been able to achieve speedups of up to 364on 1024 processors and 230 on 512 processors overa highly ecient sequential implementation for fairlysmall problems. A recent implementation of a varia-tion of this scheme with improved load balancing re-ports 46 GFLOPS on moderate sized problems ona 256-processor Cray T3D [17]. To the best of ourknowledge, this is the rst parallel implementationof sparse Cholesky factorization that has deliveredspeedups of this magnitude and has been able to ben-et from over a thousand processors. Although wefocus on Cholesky factorization of symmetric positivedenite matrices in this paper, the ideas presentedhere are can be adapted for sparse symmetric inde-nite matrices [4] and nearly diagonally dominant ma-trices that are almost symmetric in structure [6, 13].It is dicult to derive analytical expressions for thenumber of arithmetic operations in factorization forgeneral sparse matrices. This is because the compu-tation and ll-in during the factorization of a sparsematrix is a function of the the number and positionof nonzeros in the original matrix. In the context ofthe important class of sparse matrices that are adja-cency matrices of graphs whose n-node subgraphs haveO(pn)-node separators (this class includes sparse ma-

trices arising out of all two-dimensional nite dier-ence and nite element problems), the contribution ofthis work can be summarized in Figure 1. A simplefan-out algorithm [9] with column-wise partitioning ofan N N matrix of this type on p processors resultsin an O(Np logN ) total communication volume [11](box A). The communication volume of the column-based schemes represented in box A has been improvedusing smarter ways of mapping the matrix columnsonto processors, such as, the subtree-to-subcube map-ping [11] (box B). A number of column-based paral-lel factorization algorithms [21, 3, 25, 9, 8, 15, 14, 28]have a lower bound of O(Np) on the total communica-tion volume. In [1], Ashcraft proposes a fan-both fam-ily of parallel Cholesky factorization algorithms thathave a total communication volume of O(Npp logN ).A few schemes with two-dimensional partitioning ofthe matrix have been proposed [22, 29, 27, 26], andthe total communication volume in the best of theseschemes [27, 26] is O(Npp logp) (box C).In summary, the simple parallel algorithm withO(Np logp) communication volume (box A) has beenimproved along two directions|one by improving themapping of matrix columns onto processors (box B)and the other by splitting the matrix along both rowsand columns (box C). In this paper, we describe aparallel implementation of sparse matrix factorizationthat combines the benets of improvements along boththese lines. The total communication overhead of ouralgorithm is only O(Npp) for factoring an N Nmatrix on p processors if it corresponds to a graphthat satises the separator criterion. Our algorithmreduces the communication overhead by a factor of atleast O(logp) over the best algorithm implemented todate. It is also signicantly simpler in concept as wellas in implementation, which helps in keeping the con-stant factors associated with the overhead term low.Furthermore, as we show in Section 4.3, this reduc-tion in communication overhead by a factor of O(logp)results in an improvement in the isoeciency metricof scalability [18, 12] by a factor of O((log p)3); i.e.,the rate at which the problem size must increase withthe number of processors to maintain a constant ef-ciency is lower by a factor of O((log p)3). This canmake the dierence between the feasibility and non-feasibility of parallel sparse factorization on highlyparallel (p 256) computers.2 The Multifrontal Algorithm forSparse Matrix FactorizationThe parallel formulation presented in this paper isbased on the multifrontal algorithm for sparse matrixfactorization, which is described in detail in [20, 13, 5].In this section, we brie y describe a simplied versionof multifrontal sparse Cholesky factorization.Given a sparse matrix and the associated elimi-nation tree, the multifrontal algorithm can be recur-

a

b

c

d

e f

iii0

i

i

i0

1

2

1 2

i0

i

i

i0 i i

2

3

32

h

g

i

j

k l

i0

i0

ii

i

i

i

2

1

2

3

1 i3

b

a+g

c+h

i

d

e f+j

k l0

=+Figure 4: The extend-add operation on two 3 3 tri-angular matrices. It is assumed that i0 < i1 < i2 < i3.sively formulated as shown in Figure 2. Consider theCholesky factorization of an N N sparse symmet-ric positive denite matrix A into LLT , where L isa lower triangular matrix. The algorithm performs apostorder traversal of the elimination tree associatedwith A. There is a frontal matrix F k and an updatematrix Uk associated with any node k. The row andcolumn indices of F k correspond to the indices of rowand column k of L in increasing order.In the beginning, F k is initialized to an (s + 1) (s+1) matrix, where s+1 is the number of nonzeros inthe lower triangular part of column k of A. The rstrow and column of this initial F k is simply the uppertriangular part of row k and the lower triangular partof column k of A. The remainder of F k is initializedto all zeros. After the algorithm has traversed all thesubtrees rooted at a node k, it ends up with a dense(t + 1) (t + 1) frontal matrix F k, where t N isthe number of nonzeros in the strictly lower triangularpart of column k in L. If k is a leaf in the eliminationtree of A, then the nal F k is the same as the initialF k. Otherwise, the nal F k for eliminating node k isobtained by merging the initial F k with the updatematrices obtained from all the subtrees rooted at kvia an extend-add operation. The extend-add is anassociative and commutative operation described inFigure 4.After F k has been assembled, a single step of thestandard dense Cholesky factorization is performedwith node k as the pivot (lines 812, Figure 2). Atthe end of the elimination step, the column with in-dex k is removed from F k and forms the column k ofL. The remaining t t matrix is called the updatematrix Uk and is passed on to the parent of k in theelimination tree.3 A Parallel Multifrontal AlgorithmIn this section we describe the parallel multifrontalalgorithm for a p-processor hypercube. The algorithmalso can be adapted for a mesh topology withoutany increase in the asymptotic communication over-head [13]. On other architectures as well, such as theCM-5, IBM SP-2 and Cray T3D, the asymptotic ex-pression for the communication overhead remains thesame. In this paper, we use the term supernode for a

% A is a sparse N N symmetric positive denite matrix to be factored. L is the lower triangular factor matrix such% that A = LLT after factorization. A = (ai;j) and L = (li;j ), where 0 i; j < N . Initially, all li;j = 0.1. begin function Factor(k)2. F k := 0BBBB@ ak;k ak;q1 ak;q2 ak;qsaq1;k 0 0 0aq2;k 0 0 0... ... ... . . . ...aqs;k 0 0 0 1CCCCA;3. for all i such that Parent(i) = k in the elimination tree of A, do4. begin5. Factor(i);6. F k := Extend add(F k, U i);7. end% At this stage, F k is a (t+ 1) (t+ 1) matrix, where t is the number of nonzeros in the sub-diagonal part of column% k of L. Uk is a t t matrix. Assume that an index i of F k or Uk corresponds to the index qi of A and L.8. for i := 0 to t do9. L(qi; k) := F k(i; 0)=pF k(0; 0);10. for j := 1 to t do11. for i := j to t do12. Uk(i; j) := F k(i; j) L(qi; k) L(qj ; k);13. end function Factor.Figure 2: An elimination-tree guided recursive formulation of the multifrontal sparse Cholesky algorithm. If r isthe root of the postordered elimination tree of A, then a call to Factor(r) factors the matrix A.group of consecutive nodes in the elimination tree withone child. Henceforth, any reference to the height,depth, or levels of the tree will be with respect to thesupernodal tree. For the sake of simplicity, we assumethat the supernodal elimination tree is a binary treeup to the top log p supernodal levels. Any eliminationtree can be converted to a binary supernodal tree suit-able for parallel multifrontal elimination by a simplepreprocessing step.In order to factorize the sparse matrix in parallel,portions of the elimination tree are assigned to pro-cessors using the standard subtree-to-subcube assign-ment strategy. This assignment is illustrated in Fig-ure 3 for eight processors. With subtree-to-subcubeassignment, all p processors in the system cooperateto factorize the frontal matrix associated with the top-most supernode of the elimination tree. The two sub-trees of the root are assigned to subcubes of p=2 pro-cessors each. Each subtree is further partitioned re-cursively using the same strategy. Thus, the p subtreesat a depth of log p supernodal levels are each assignedto individual processors. A call to the function Fac-tor given in Figure 2 with the root of a subtree asthe argument generates the update matrix associatedwith that subtree. This update matrix contains all theinformation that needs to be communicated from thesubtree in question to other columns of the matrix.After the independent factorization phase, pairs ofprocessors (P2j and P2j+1 for 0 j < p=2) performa parallel extend-add on their update matrices, say Q

and R, respectively. At the end of this parallel extend-add operation, P2j and P2j+1 roughly equally shareQ+ R. Here, and in the remainder of this paper, thesign \+" in the context of matrices denotes an extend-add operation. More precisely, all even columns ofQ+ R go to P2j and all odd columns of Q + R go toP2j+1. At the next level, subcubes of two processorseach perform a parallel extend-add. Each subcubeinitially has one update matrix. The matrix result-ing from the extend-add on these two update matri-ces is now merged and split among four processors.To eect this split, all even rows are moved to thesubcube with the lower processor labels, and all oddrows are moved to the subcube with the higher proces-sor labels. During this process, each processor needsto communicate only once with its counterpart in theother subcube. After this (second) parallel extend-addeach of the processors has a block of the update ma-trix roughly one-fourth the size of the whole updatematrix.Note that, both the rows and the columns of theupdate matrix are distributed among the processorsin a cyclic fashion. Similarly, in subsequent parallelextend-add operations, the update matrices are alter-natingly split along the columns and rows. This dis-tribution is fairly straightforward to maintain. Forexample, during the rst two parallel extend-add op-erations, columns and rows of the update matrices aredistributed depending on whether their least signi-cant bit (LSB) is 0 or 1. Indices with LSB = 0 go

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

17

18

16

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X X

X X X

X X X

X X X

X X X X X X

X XX X

X X X X X X

XXXX

X X X

X X X

X X X

X X X

X X X

X X X X X X

XX X X

XXXX X X

X X X X

X X XX X X

X

X

0 1 3 4 9 10 12 13

2 5 11 14

6

7

8

15

16

17

18

Level 3

Level 0

Level 1

Level 2

P P P P P P P P1 2 3 4 5 6 7

P P P P

PP

P

0,1

0

2,3 4,5 6,7

0,1,2,3 4,5,6,7

0,1,2,3,4,5,6,7

Figure 3: A symmetric sparse matrix and the associated elimination tree with subtree-to-subcube mapping onto8 processors. The nonzeros in the original matrix are denoted by \" and ll-ins by \".to the lower subcube and those with LSB = 1 go tothe higher subcube. Similarly, in the next two par-allel extend-add operations, columns and rows of theupdate matrices are exchanged among the processorsdepending on the second LSB of their indices.Assume that the levels of the binary supernodalelimination tree are labeled starting with 0 at the top(Figure 3). In general, at level l of the supernodalelimination tree, 2logpl processors work on a singlefrontal or update matrix. These processors form a log-ical 2b(logpl)=2c 2d(log pl)=2e mesh. The cyclic dis-tribution rows and columns of these matrices amongthe processors helps maintain load-balance. The dis-tribution also ensures that a parallel extend-add oper-ation can be performed with each processor exchang-ing roughly half of its data only with its counterpartprocessor in the other subcube.Between two successive parallel extend-add opera-tions, several steps of dense Cholesky elimination maybe performed. The number of such successive elimi-nation steps is equal to the number of nodes in thesupernode being processed. The communication thattakes place in this phase is the standard communica-tion in pipelined grid-based dense Cholesky factoriza-tion [23, 18]. If the average size of the frontal matri-ces is t t during the processing of supernode withm nodes on a q-processor subcube, then O(m) mes-sages of size O(t=pq) are passed through the grid ina pipelined fashion. Figure 5 shows the communica-tion for one step of dense Cholesky factorization of ahypothetical frontal matrix for q = 16. It is shown in[18] that although this communication does not take

place between the nearest neighbors on a subcube, thepaths of all communications on any subcube are con- ict free with e-cube routing [18] and cut-through orworm-hole ow control. This is a direct consequenceof the fact that a circular shift is con ict free on a hy-percube with e-cube routing. Thus, a communicationpipeline can be maintained among the processors ofa subcube during the dense Cholesky factorization offrontal matrices.Further details of the parallel algorithm can befound in [13].4 Communication Overhead and Scal-ability AnalysisIn this section, we analyze the communication over-head and the scalability of our algorithm on a hy-percube for sparse matrices resulting from a nitedierence operator on regular twodimensional grids.Within constant factors, these expressions can be gen-eralized to all sparse matrices that are adjacencymatrices of graphs whose n-node subgraphs have anO(pn)-node separator because the properties of sepa-rators can be generalized from grids to all such graphswithin the same order of magnitude bounds [19, 10].In [13], we derive expressions for the communicationoverhead of this algorithm for three-dimensional gridsand for the mesh architecture and show that the algo-rithm is optimally scalable in these cases as well.The parallel multifrontal algorithm described inSection 3 incurs two types of communication over-

7654

7654

546 7

9 1312

1 4 5

1

5

9

0

3 11 1

3

7

11

0

15 5 7 13

14

9

15

3 6 7

11

5

9

3

7

11

0

0

2

10

0 1

3

6

10

0

4

8

0

0

2 8 10

3

6

10

0

12 14 4 6

Horizontal comunication

Vertical communication

0

0

8

04

0

0

0

1

1 1

0 2 3 4 5 7 8 10 111 6

2 3 2 3 2 300

2 30

2 3

2 3

(0,1,4,5), (2,3,6,7), (8,9,12,13),(10,11,14,15)

(0,2,8,10), (1,3,9,11), (4,6,12,14),(5,7,13,15)

Vertical subcubes

Horizontal subcubes

8

8 1210 11 14 15

9

8 9 12 13 8 9

8 9 12 13 8 9 12

1210 11 14 15 10 11 14 15

10 11 14 15 10 11 14 15 151110 14129813

0

1

2

3

4

5

6

7

8

9

10

11

9

Processor numbers

Matrix indicesFigure 5: The two communication operations involved in a single elimination step (index of pivot = 0 here) ofCholesky factorization on a 12 12 frontal matrix distributed over 16 processors.head: one during parallel extend-add operations andthe other during the steps of dense Cholesky factor-ization while processing the supernodes. Crucial toestimating the communication overhead is estimatingthe sizes of frontal and update matrices at any levelof the supernodal elimination tree.Consider a pN pN regular nite dierence grid.We analyze the communication overhead for factor-izing the N N sparse matrix associated with thisgrid on p processors. In order to simplify the analy-sis, we assume a somewhat dierent form of nested-dissection than the one used in the actual implemen-tation. This method of analyzing the communicationcomplexity of sparse Cholesky factorization has beenused in [11]. Within very small constant factors, theanalysis holds for the standard nested dissection [10]of grid graphs. We consider a cross-shaped separator(described in [11]) consisting of 2pN 1 nodes thatpartitions the N -node square grid into four square sub-grids of size (pN1)=2(pN1)=2. We call this thelevel-0 separator that partitions the original grid (orthe level-0 grid) into four level-1 grids. The subgridsare further partitioned in the same way, and the pro-cess is applied recursively. The supernodal eliminationtree corresponding to this ordering is such that eachnon-leaf supernode has four child supernodes. Thetopmost supernode has 2pN 1 ( 2pN ) nodes, andthe size of the supernodes at each subsequent supern-odal level of the tree is half of the supernode size atthe previous level. Clearly, the number of supernodesincreases by a factor of four at each level, starting withone at the top (level 0).The nested dissection scheme described above hasthe following properties: (1) the size of level-l subgrids

is approximately pN=2l pN=2l, (2) the number ofnodes in a level-l separator is approximately 2pN=2l,and hence, the length of a supernode at level l of thesupernodal elimination tree is approximately 2pN=2l.It has been proved in [11] that the number of nonze-ros that an i i subgrid can contribute to the nodesof its bordering separators is bounded by ki2, wherek = 341=12. Hence, a level-l subgrid can contributeat most kN=4l nonzeros to its bordering nodes. Thesenonzeros are in the form of the triangular update ma-trix that is passed along from the root of the subtreecorresponding to the subgrid to its parent in the elim-ination tree. The dimensions of a matrix with a densetriangular part containing kN=4l entries is roughlyp2kN=2l p2kN=2l. Thus, the size of an updatematrix passed on to level l1 of the supernodal elim-ination tree from level l is roughly upper-bounded byp2kN=2l p2kN=2l for l 1.The length of a level-l supernode is 2pN=2l; hence,a total of 2pN=2l elimination steps take place whilethe computation proceeds from the bottom of a level-lsupernode to its top. A single elimination step on afrontal matrix of size (t+1)(t+1) produces an updatematrix of size t t. Since the size of a frontal matrixat the top of a level-l supernode is at mostp2kN=2lp2kN=2l, the size of the frontal matrix at the bottomof the same supernode is upper-bounded by (p2k 2)pN=2l(p2k2)pN=2l. Hence, the average size ofa frontal matrix at level l of the supernodal eliminationtree is upper-bounded by (p2k 1)pN=2l (p2k 1)pN=2l. Let p2k1 = . Then pN=2lpN=2lis an upper bound on the average size of a frontalmatrix at level l.

4.1 Overhead in parallel extend-addBefore the computation corresponding to level l1of the supernodal elimination tree starts, a parallelextend-add operation is performed on lower triangularportions update matrices of size p2kN=2lp2kN=2l,each of which is distributed on a pp=2l pp=2l log-ical mesh of processors. Thus, each processor holdsroughly (kN=4l) (p=4l) = kN=p elements of an up-date matrix. Assuming that each processor exchangesroughly half of its data with the corresponding proces-sor of another subcube, ts+twkN=(2p) time is spent incommunication, where ts is the message startup timeand tw is the per-word transfer time. Note that thistime is independent of l. Since there are (logp)=2 lev-els at which parallel extend-add operations take place,the total communication time for these operations isO(N=p) log p on a hypercube. The total communica-tion overhead due to the parallel extend-add opera-tions is O(N log p) on a hypercube.4.2 Overhead in factorization stepsWe have shown earlier that the average size of afrontal matrix at level l of the supernodal elimina-tion tree is bounded by pN=2l pN=2l, where = p341=6 1. This matrix is distributed on app=2l pp=2l logical mesh of processors. As shownin Figure 5, there are two communication operationsinvolved with each elimination step of dense Cholesky.The average size of a message is (pN=2l) (pp=2l)= pN=p. It can be shown [23, 18] that in apipelined implementation on a pq pq mesh of pro-cessors, the communication time for s eliminationsteps with an average message size of m is O(ms).In our case, at level l of the supernodal elimina-tion tree, the number of steps of dense Cholesky is2pN=2l. Thus the total communication time at levell is pN=p2pN=2l = O((N=pp)(1=2l)). The totalcommunication time for the elimination steps at top(log p)=2 levels of the supernodal elimination tree isO((N=pp)log4 p1l=0 (1=2l)). This has an upper boundof O(N=pp). Hence, the total communication over-head due the elimination steps is O(Npp).The parallel multifrontal algorithm incurs an addi-tional overhead of emptying the pipeline logp times(once before each parallel extend-add) and then re-lling it. It can be easily shown that this overheadis is smaller in magnitude than the O(Npp) commu-nication overhead of the dense Cholesky eliminationsteps [13].4.3 Scalability analysisThe scalability of a parallel algorithm on a parallelarchitecture refers to the capacity of the algorithm-architecture combination to eectively utilize an in-

creasing number of processors. In this section we usethe isoeciency metric [18, 12] to characterize thescalability of our algorithm.Let W be the size of a problem in terms of thetotal number of basic operations required to solve aproblem on a serial computer. The serial run time ofa problem of size W is given by TS = tcW , where tc isthe time to perform a single basic computation step.If TP is the parallel run time of the same problem onp processors, then we dene an overhead function Toas pTP TS . Both TP and To are functions of W andp, and we often write them as TP (W; p) and To(W; p),respectively. The eciency of a parallel system withp processors is given by E = TS=(TS + To(W; p)). Ifa parallel system is used to solve a problem instanceof a xed size W , then the eciency decreases as pincreases. For many parallel systems, for a xed p, ifthe problem size W is increased, then the eciencyincreases. For these parallel systems, the eciencycan be maintained at a desired value (between 0 and1) for increasing p, provided W is also increased.Given that E = 1=(1 + To(W; p)=(tcW )), in orderto maintain a xed eciency, W should be propor-tional to To(W; p). In other words, the following re-lation must be satised in order to maintain a xedeciency: W = etcTo(W; p); (1)where e = E=(1 E) is a constant depending on theeciency to be maintained. Equation (1) is the cen-tral relation that is used to determine the isoeciencyfunction of a parallel algorithm-architecture combina-tion.It is well known [10] that the total work involvedin factoring the adjacency matrix of an N -node graphwith an O(pN )-node separator using nested dissec-tion ordering of nodes is O(N1:5). We have shown inSection 4 that the overall communication overhead ofour scheme is O(Npp). From Equation 1, a xed e-ciency can be maintained if and only if N1:5 / Npp,orpN / pp, or N1:5 =W / p1:5. In other words, theproblem size must be increased as O(p1:5) to maintaina constant eciency as p is increased. In comparison,a lower bound on the isoeciency function of Roth-berg and Gupta's scheme [27] with a communicationoverhead of at least O(Npp log p) is O(p1:5(log p)3).The isoeciency function of any column-based schemeis at least O(p3) because the total communicationoverhead has a lower bound of O(Np). Thus, the scal-ability of our algorithm is superior to that of the otherschemes.A lower bound on the isoeciency function fordense matrix factorization is O(p1:5) [18]. The factor-ization of a sparse matrix involves factoring a densematrix and the amount of computation required tofactor this dense matrix is of the same order as theamount of computation required to factor the entiresparse matrix [10, 11]. Hence, O(p1:5) is a lower bound

Time

Efficiency

Speedup

4 8 16 32 64 128

Time

Efficiency

Speedup

51225616 32 64 128

p 1

Matrix: GRID127x127; N = 16129; NNZ = 518.58 thousand; FLOP =36.682 million

58.86 15.07 5.1558.878 2.971 1.783 1.171 .8546

1.00 3.90 6.63 19.811.4 33.0 50.3 68.9

100.0% 97.6% 82.8% 71.0% 61.9% 51.6% 39.3% 26.9%

256

149.15 12.22 6.651 3.861 2.349 1.557 1.091 .8357

1.00 12.2 22.4 38.6 63.5 95.8 136.7 178.5

100.0% 76.3% 70.1% 60.4% 49.6% 37.4% 26.7% 17.4%

p 10241

Matrix: GRID255x127; N = 32385; NNZ = 1140.6 thousand; FLOP =100.55 millionFigure 6: Experimental results for factoring sparse symmetric positive denite matrices associated with a 9-pointdierence operator on rectangular grids. All times are in seconds.on the isoeciency function of sparse matrix factor-ization, and our algorithm achieves this bound.5 Experimental ResultsWe implemented the parallel multifrontal algorithmdescribed in this paper on the nCUBE2 parallel com-puter and have gathered some preliminary speedupresults for two classes of sparse matrices, which wesummarize in Figures 6 and 8. The reader is referredto [13] for more detailed experimental results and theirinterpretations.We conducted one set of experiments on sparse ma-trices associated with a 9-point dierence operator onrectangular grids (Figure 6). Matrix GRIDixj in thegure refers to the sparse matrix obtained form an ijgrid. The standard nested dissection ordering [10] wasused for these matrices. The purpose of these experi-ments was to compare their results with the scalabilityanalysis in Section 4.3. From our experiments on the2-D grids, we selected a few points of equal eciencyand plotted the W versus p curve, which is shown bythe solid line in Figure 7 for E 0:31. The prob-lem size W is measured in terms of the total numberof oating point operations required for factorization.The dotted and dashed curves indicate the problemsizes that will yield the same eciency for p = 128,256, and 512 if the isoeciency function is O(p1:5) andO(p1:5(log p)3), respectively.Figure 7 shows that our experimental isoeciencycurve is considerable better than O(p1:5(log p)3),which is a lower bound on the isoeciency function ofthe previously best known (in terms of total commu-nication volume) parallel algorithm [27] for sparse ma-trix factorization. However, it is worse than O(p1:5),which is the asymptotic isoeciency function derivedin Section 4.3. There are two main reasons for this.First, the O(p1:5) isoeciency function does not take

load imbalance into account. It has been shown in [21]that even if a grid graph is perfectly partitioned interms of the number of nodes, the work load asso-ciated with each partition varies. Another reason isthat the eciencies of the parallel implementation arecomputed with respect to a very ecient serial imple-mentation. In our implementation, the computationassociated with the subtrees below level logp in thesupernodal elimination tree is handled by the serialcode. However, the computation above this level ishandled by a separate code. In our preliminary imple-mentation, this part of the code is less ecient thanthe serial code (disregarding communication) due toadditional bookkeeping, which has a potential for op-timization. For example, the total time spent by allthe processors participating in a parallel extend-addoperation besides message passing is more than thetime taken to perform extend-add on the same up-date matrices on a single processor. The same is truefor the dense factorization steps too. However, despitethese ineciencies, our implementation is more scal-able than a hypothetical ideal implementation (withperfect load balance) of the currently best known par-allel algorithm for sparse Cholesky factorization.In Figure 8 we summarize the results of factoringsome matrices from the Harwell-Boeing collection ofsparse matrices. The purpose of these experimentswas to to demonstrate that our algorithm can delivergood speedups on hundreds of processors for practicalproblems. Spectral nested dissection (SND) [24] wasused to order the matrices in Figure 8.From the experimental results in Figures 6 and 8,we can infer that our algorithm can deliver substan-tial speedups, even on moderate problem sizes. Thesespeedups are computed with respect to a very e-cient serial implementation of the multifrontal algo-rithm. To lend credibility to our speedup gures,we compared the run times of our program on asingle processor with the single processor run times

W

(in million floatingpoint ops)

p

0

50

100

150

200

250

300

350

50 100 150 200 250 300 350 400 450 500 550

Experimental

31.5

1.5

O(p (log(p)) )

O(p )

Figure 7: Comparison of our experimental isoeciency curves with O(p1:5) curve (theoretical asymptotic isoe-ciency function of our algorithm due to communication overhead on a hypercube) and with O(p1:5(log p)3) curve(the lower bound on the isoeciency function of the best known parallel sparse factorization algorithm until now).The four data points on the curves correspond to the matrices GRID63x63, GRID103x95, GRID175x127, andGRID223x207.given for iPSC/2 in [25]. The nCUBE2 processors areabout 2 to 3 times faster than iPSC/2 processors andour serial implementation, with respect to which thespeedups are computed, is 4 to 5 times faster thanthe one in [25]. Our single processor run times arefour times less than the single processor run timeson iPSC/2 reported in [2]. We also found that forsome matrices (e.g., that from a 127 127 9-pointnite dierence grid), our implementation on eightnCUBE2 processors (8.9 seconds) is faster than the16-processor iPSC/860 implementation (9.7 seconds)reported in [29], although iPSC/860 has much highercomputation speeds.The factorization algorithm as described in this pa-per requires a binary supernodal elimination trees thatare fairly balanced. After obtaining the ordered ma-trix and the corresponding elimination tree, we run theelimination tree through a very fast tree balancing al-gorithm, which is described in [13]. Figure 8 also givesthe upper bound on eciency due to load imbalancefor dierent values of p for BCSSTK15, BCSSTK25,and BCSSTK29 matrices. It is evident from the loadbalance values given in Figure 8 that a combinationof spectral nested dissection with our tree balancingalgorithm results in very respectable load balances forup to 1024 processors. The number of nonzeros in thetriangular factor (NNZ) and the number of oatingpoint operations (FLOP) reported in Figure 8 are forthe single processor case. As the number of processorsis increased, the tree balancing algorithm is applied tomore levels (logp) of the supernodal elimination tree,and consequently, the total NNZ and FLOP increase.Thus, an eciency of x% in Figure 8 indicates thatthere is a (100 x)% loss, which includes three fac-

tors: communication, load imbalance, and extra work.For example, the eciency for BCSSTK25 on 64 pro-cessors is 57%; i.e., there is a 43% overhead. However,only 5% overhead is due to communication and ex-tra work. The remaining 38% overhead is due to loadimbalance.6 Concluding RemarksIn this paper, we analytically and experimentallydemonstrate that scalable parallel implementations ofdirect methods for solving large sparse systems arepossible. In [28], Schreiber concludes that it is notyet clear whether sparse direct solvers can be madecompetitive at all for highly (p 256) and massively(p 4096) parallel computers. We hope that, throughthis paper, we have given an armative answer to atleast a part of the query.Our algorithm is a signicant improvement overother parallel algorithms for sparse matrix factoriza-tion. This improvement has been realized by reducingthe order of the communication overhead. An ecientparallel sparse matrix factorization algorithm requiresthe following: (1) the matrix must be partitionedamong the processors along both rows and columnsbecause pure column-based partitioning schemes havemuch higher communication overhead [28, 27], (2) thedistribution of rows and columns of the matrix amongthe processors must follow some sort of cyclic order toensure proper load balance, and (3) communicationmust be localized among as few processors as possibleat every stage of elimination by following a subtree-to-

Matrix: BCSSTK31; N = 35588; NNZ = 6458.34thousand; FLOP = 2583.6 million

1024

1 2 4 8 16 32

512256

51264 1024

1 2 4 8 16 32

128

128

256

512

588.5

64

301.23 184.84 74.71 52.29 30.01 16.66 10.38 6.64 4.53

Time

Efficiency

Speedup

1 2 4 8

256

32 64

1024

256 512 1024

128

16p

1690.7 924.6 503.0 262.0 134.3 73.57 42.02 24.58 14.627 9.226

1.00

100.0%

3358.0*

1.99 3.63 6.68 12.8 25.0 45.6 79.9 136.6 229.6 364.2

99.3% 90.8% 83.4% 80.1% 78.1% 71.3% 62.4% 53.4% 44.8% 35.6%

2 64321684

128

1

p

p

p

103.73 52.63 26.66 14.88 8.29 4.98 3.20 2.156 1.530

1.00 1.97 3.89 6.97 12.5 20.8 32.4 48.1 67.8

100.0% 98.5% 97.3% 87.1% 78.2% 65.1% 50.7% 37.6% 26.5%

100% 99% 98% 91% 91% 87% 87% 84% 84%

1.00 1.95 3.18 6.21 11.3 19.6 35.3 56.7 88.6 129.9

100.0% 97.7% 79.6% 77.7% 70.3% 61.3% 57.0% 44.3% 34.6% 25.4%

100% 98% 80% 78% 71% 63% 62% 62% 62% 61%

704.0 359.7 212.9 110.45 55.06 31.36 19.22 12.17 7.667 4.631 3.119

1.00 1.96 3.31 6.37 12.8 22.5 36.6 57.9 91.8 152.6 225.6

100.0% 97.9% 82.7% 79.7% 79.9% 70.2% 57.2% 45.2% 35.9% 29.8% 22.0%

100% 98% 83% 82% 84% 82% 77% 72% 68% 72% 72%

Time

Efficiency

Speedup

Time

Efficiency

Speedup

Time

Efficiency

Speedup

Load balance

Load balance

Load balance

Matrix: BCSSTK15; N = 3948; NNZ = 488.8 thousand; FLOP = 85.55 million



Figure 8: Experimental results for factoring some sparse symmetric positive denite matrices resulting from 3-Dproblems in structural engineering. All times are in seconds. The single processor run time suxed by \*" wasestimated by timing dierent parts of factorization on two processors. The percentage load balance in the lastrow of the rst three tables gives an upper bound on eciency due to load imbalance.subcube type work distribution strategy [15, 11]. Oursis the only implementation we know of that satisesall of the above conditions.The analysis and a preliminary implementation onnCUBE2 presented in this paper demonstrates thefeasibility of using highly parallel computers for nu-merical factorization of sparse matrices. In [16], wehave applied our algorithm to obtain a parallel for-mulation of interior point algorithms and have ob-served signicant speedups in solving linear program-ming problems. Although we have observed throughour experiments (Section 5) that the upper bound oneciency due to load imbalance does not fall below6070% for hundreds of processors, even this boundcan be improved further. In [17], Karypis and Kumarrelax the subtree-to-subcube mapping to a subforest-to-subcube mapping, which signicantly reduces loadimbalances at the cost of a little increase in communi-cation. Their variation of the algorithm achieves 46GFLOPS on a 256-processor Cray T3D for moderatesize problems, all of which can be solved in less than 2seconds on the 256-processor Cray T3D. Currently, animplementation of our parallel multifrontal algorithmis being developed for the IBM SP-1 and SP-2 paral-lel computers and a cluster of high performance SGIworkstations. Compared to other parallel sparse ma-trix factorization algorithms, our algorithm is muchbetter suited for such platforms because it has theleast communication overhead and communication isslow on these platforms relative to the speeds of theirCPUs.AcknowledgmentsWe wish to thank Dr. Alex Pothen for his guid-ance with spectral nested dissection ordering. GeorgeKarypis gave several useful comments and suggestionsin the development of the parallel multifrontal algo-rithm. We are also grateful to Prof. Sartaj Sahni and

the systems sta at the CIS Department of Univer-sity of Florida for providing access to a 64-processornCUBE2. Sandia National Labs provided access totheir 1024-processor nCUBE2.References[1] Cleve Ashcraft. The fan-both family of column-based dis-tributed cholesky factorization algorithms. In A. George,John R. Gilbert, and J. W.-H. Liu, editors, Graph Theoryand Sparse Matrix Computations. Springer-Verlag, NewYork, NY, 1993.[2] Cleve Ashcraft, S. C. Eisenstat, and J. W.-H. Liu. A fan-in algorithm for distributed sparse numerical factorization.SIAM Journal on Scientic and Statistical Computing,11:593599, 1990.[3] Cleve Ashcraft, S. C. Eisenstat, J. W.-H. Liu, and A. H.Sherman. A comparison of three column based distributedsparse factorization schemes. In Proceedings of the FifthSIAM Conference on Parallel Processing for ScienticComputing, 1991.[4] Iain S. Du, N. I. M. Gould, J. K. Reid, J. A. Scott, andK. Turner. The factorization of sparse symmetric indef-inite matrices. Technical report, Central Computing De-partment, Atlas Center, Rutherford Appleton Laboratory,Oxon OX11 0QX, UK, August 1990.[5] Iain S. Du and J. K. Reid. The multifrontal solution ofindenite sparse symmetric linear equations. ACM Trans-actions on Mathematical Software, 9:302325, 1983.[6] Iain S. Du and J. K. Reid. The multifrontal solution ofunsymmetric sets of linear equations. SIAM Journal onScientic and Statistical Computing, 5(3):633641, 1984.[7] K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallelalgorithms for dense linear algebra computations. SIAMReview, 32(1):54135, March 1990.[8] G. A. Geist and E. G.-Y. Ng. Task scheduling for paral-lel sparse Cholesky factorization. International Journal ofParallel Programming, 18(4):291314, 1989.[9] A. George, M. T. Heath, J. W.-H. Liu, and E. G.-Y. Ng.Sparse Cholesky factorization on a local memorymultipro-cessor. SIAM Journal on Scientic and Statistical Com-puting, 9:327340, 1988.[10] A. George and J. W.-H. Liu. Computer Solution of LargeSparse Positive Denite Systems. Prentice-Hall, Engle-wood Clis, NJ, 1981.[11] A. George, J. W.-H. Liu, and E. G.-Y. Ng. Communica-tion results for parallel sparse Cholesky factorization on ahypercube. Parallel Computing, 10(3):287298, May 1989.[12] Ananth Grama, Anshul Gupta, and Vipin Kumar. Isoe-ciency: Measuring the scalabilityof parallel algorithmsandarchitectures. IEEE Parallel and Distributed Technology,1(3):1221, August, 1993.[13] Anshul Gupta and Vipin Kumar. A scalable parallel algo-rithm for sparse matrix factorization. Technical Report 94-19, Department of Computer Science, University of Min-nesota, Minneapolis, MN, 1994. Available in users/kumarat anonymous FTP site ftp.cs.umn.edu.[14] M. T. Heath, E. G.-Y. Ng, and Barry W. Peyton. Par-allel algorithms for sparse linear systems. SIAM Review,33:420460, 1991.

[15] Laurie Hulbert and Earl Zmijewski. Limiting communica-tion in parallel sparse Cholesky factorization. SIAM Jour-nal on Scientic and Statistical Computing, 12(5):11841197, September 1991.[16] George Karypis, Anshul Gupta, and Vipin Kumar. Par-allel formulation of interior point algorithms. TechnicalReport 94-20, Department of Computer Science, Univer-sity of Minnesota, Minneapolis, MN, April 1994. A shortversion appears in Supercomputing '94 Proceedings.[17] George Karypis and Vipin Kumar. A high performancesparse Cholesky factorization algorithm for scalable paral-lel computers. Technical report TR 94-41, Department ofComputer Science, University of Minnesota, Minneapolis,MN, 1994.[18] Vipin Kumar, Ananth Grama, Anshul Gupta, and GeorgeKarypis. Introduction to Parallel Computing: Design andAnalysis of Algorithms. Benjamin/Cummings, RedwoodCity, CA, 1994.[19] R. J. Lipton and R. E. Tarjan. A separator theorem forplanar graphs. SIAM Journal on Applied Mathematics,36:177189, 1979.[20] J. W.-H. Liu. The multifrontal method for sparse matrixsolution: Theory and practice. SIAM Review, 34:82109,1992.[21] Robert F. Lucas, Tom Blank, and Jerome J. Tiemann. Aparallel solution method for large sparse systems of equa-tions. IEEE Transactions on Computer Aided Design,CAD-6(6):981991, November 1987.[22] Vijay K. Naik and M. Patrick. Data trac reductionschemes Cholesky factorization on aynchronous multipro-cessor systems. In Supercomputing '89 Proceedings, 1989.[23] Dianne P. O'Leary and G. W. Stewart. Assignment andscheduling in parallel matrix factorization. Linear Algebraand its Applications, 77:275299, 1986.[24] Alex Pothen, H. D. Simon, and Lie Wang. Spectral nesteddissection. Technical Report 92-01, Computer Science De-partment, Pennsylvania State University, University Park,PA, 1992.[25] Alex Pothen and Chunguang Sun. Distributedmultifrontalfactorization using clique trees. In Proceedings of theFifth SIAM Conference on Parallel Processing for Scien-tic Computing, pages 3440, 1991.[26] Edward Rothberg. Performance of panel and block ap-proaches to sparse Cholesky factorization on the iPSC/860and Paragon multicomputers. In Proceedings of the 1994Scalable High Performance Computing Conference, May1994.[27] Edward Rothberg and Anoop Gupta. An ecient block-oriented approach to parallel sparse Cholesky factorization.In Supercomputing '92 Proceedings, 1992.[28] Robert Schreiber. Scalability of sparse direct solvers. InA. George, John R. Gilbert, and J. W.-H. Liu, editors,Sparse Matrix Computations: Graph Theory Issues andAlgorithms (An IMA Workshop Volume). Springer-Verlag,New York, NY, 1993.[29] Sesh Venugopal and Vijay K. Naik. SHAPE: A paralleliza-tion tool for sparse matrix computations. Technical ReportDCS-TR-290, Department of Computer Science, RutgersUniversity, New Brunswick, NJ, June 1992.

A Scalable Parallel Algorithm for Sparse Cholesky Factorization

Documents

Transcript of A Scalable Parallel Algorithm for Sparse Cholesky Factorization