Post on 05-Mar-2023
Scalable Solutions to ]ntegral ILquation andFinite Itlcmcnt Simulations
“1’OIN (_Mik, [)aniel S. Ka(z,i and Jean l’atlcrscm
Jet I’repulsion 1.aboratory4800 Oak Grove IMivc
California lnslitutc of I’cchnologyI’asadcna, California, 91109
cwik@jpl.nasa.gov
*(lay Research Inc.
AIIS’1’RAC’1’
When developing numerical methods, or applying them to the simulation and design of
cnginccring components, it inevi(ab]y bccomcs necessary to examine the scaling of tlic rndhod
with a problcm’s electrical size. The scaling results from the original mathematical development- --
for example, a dense system of equations in the solution of integral equations-as well as the
specific numerical implementation. Scaling of the numerical implementation depends upon many
factors–-for example, direct or iterative methods for solution of the linear system-–as well as the
computer architecture used in the simulation. In this paper, scalability will be divided into two
components; scalability of the numerical algorithm specifically on parallel computer systcm.s, and
algorithm or sequential scalabi]it y. The sequential implementation and scaling is initial presented
with the parallel inlplcmcntation following. “1’his progression is meant to illustrate the differences
in using current parallel platforms from sequential machines, and the resulting savings. Time to
solution (wall clock time) a]on~ with problc.m sim arc the kcy parameters plotted or tabulated.
Sc~ucntial and parallel scalability of time harmonic surface integral equation forms and the frnitc
clement solution to the pallial differential equations arc considered in detail.
“1’hc rcse.arcti dcscritd in this papm was car[icct OUI at the Jc.[ I’repulsion 1.abmalory, California lns(itu(e of‘l”c.chnolof,y, undc.r a contrac[ with (he National Aeronautics and Space A(l[[]illisl[:l[ic~[~
1. IN’l’I/{)I)LICrl’If)N
‘1’hc application of advanced compu(cr architcctur-c arid soflwarc to a broad range of
clcc(romagnctic problems has allowed more accuralc simulations of el~lrjcally larger and more
complex components and systems. Computational algorithms arc USCd in thc design and analysis
of antenna components and arrays, wavcguidc components, semiconductor dcviccs, and in lhc
prediction of radar scattering from aircraft, sea surface or vegetation, and atmospheric particles,
among many other applications. A scattering algorithm, for example, my be used in conjunction
with measured data to accurately reconstruct geophysical data. When used in design, the goal of a
numerical simulation is to limit the number of trial fabrication and mcasurcmcnt iterations nccdcd.
‘t’hc algorithms are used over a wide range of frcqucncics, materials and shapes, and can be
developed to be specific to a single geometry and application, or more gencrall y applied to a class
of problems. ‘i’hc algorithms are Immcrical solutions to a mathematical IllOdC] dcvclopcd in the
iimc or frcxqucncy domain, and can be based on the integral equation or pallial differential equation
form of Maxwell’s equations. A survey of the many forms of mathematical modeling and
numerical solution can be found in [1, 2.]. ‘1’hc goal of this paper is to examine (hc scalability of
ccllain numerical solutions both generally as a sequential algorithm, and then specifically as parallel
algorithms using distributed memory computers. Paral]cl computers continue to evcdvc,
surpassing traditional computer architectures in offering the largest memories and fastest
computational rates to the user, and will continue to CVOIVC for several more generations in
future [3].
This paper overviews solutions to Maxwell’s equations implicitly defined through
the near
SyStCInS
of linear equations. Sequential and parallel scalability of time harmonic surface integral equation
forms and the finite clcmcnt solu(ion to the partial differential equations are considered. More
general scalability of other sequential algorithms can be found in [4]. initially in Section 2, a short
review of tt ~c scalability y of parallel computers is presented. in Sections 3 and 4 rcspcctivcly,
specific illll>lcrIlcrltatiolls of the method of n)olnc,[l(s solution to integral equation mocicling, an(i a
frnitc clement soluticm wilt be (iiscusscd. “i’hc scalability of scqucntiai solutions and rclatcli
2
IcdLmd Incnmy IMhods will be considered followed by parallel scalability and computer
performance for these algorithms. ‘1’he size of problems capable of being examined wjth current
computer architectures will be then be listed, with scalings (O Iargcr si~,c problems also presented.
2. SCA1,AIIII.I”J’Y OF I’ARAI,I,EI, COMI’U’J’ERS
Parallel algorithm scalability is examined differently from sequential algorithm scalability, or
more prcciscly, scalability on shared memory (common address space) machines with no
inter-processor communication overhead. Since a pamllel computer is an cnscmblc of processors
and memory linked by high performance communicant ion networks, calculations that were
performed on a single processor must be broken into pieces and spread over all processors in USC.
At jntcrmediatc points in the calculation, data nccdcd in the next stage of the calculation must be
communicated to other processors.
The central consideration in using a parallel computer is to decompose
problem among processors so that the storage and computational load arc balanced,
the cliscrctimd
and the amount
of communication bctwccn processors is minimal [5,6]. When this is not handled properly,
efficiency is lower than 10WZO, where 100?io is t}]e machine performance when all processors arc
performing independent calculations and no time is used for communication. If the problcm is
decomposed poorly, some processors will work while others stand idle, thereby Iowcring machine
efficiency. Similarly, if calculations are load balanced but processors must wait to communicate
data, the efficiency is lowered. Scalability and cfflcicncy arc defined to quantify the parallel
performance of a machine. Scalability, also termed spccdup, is the ratio of time to eomplctc
calculations sequentially on a single processor (o that on P processors
. .1/‘y= “q T(I’) “
l’hc efficiency is then the ratio of scalability to the number of processors
/c+, .
(1)
(2)
If an algorithm issues no cc~llllllllllicatiotl calls, and there is no conlponcnt of Ihc calculation that is
sc.qucntia! an(i thcrcforc w.dundantly rc1Eatc41 al each processor, the scalability is cqua] to the
number of pmccssors P and the efficiency is 100%. ‘1’hc scalability, as defined, must bc further
clarified if it is to bc meaningful since the alnount of storage, i.e., problem si?,c, has not been
inc]udcd in lhc definition. ‘1’wo regimes can bc consiclcred--fixed problcm size and fixed grain
si~,c. ‘1’hc first, fixed problcnl siz,c, refers to a problcm that is small enough to fit into onc or a
fcw processors and is succcssivcl y spread over a Iargcr sized machine. ‘1’he amount of data and
calculation in each processor will dccrcasc and the amount of communication will incrcasc. The
efficiency must tbcrcforc successively dccreasc, reaching a point where OU time is
communication bound. l’hc second, fixed grain size problems, refers to a problcm size- that is
scaled to fill all the memory of the machine in use. The amoun~ of data and calculation in each
processor will be constant, and in general, much greater than the required amount of
cmnmunication. lifficicncy will remain high as successively larger problems arc solved. I:ixcd
gl-ain problems ideally exhibit scalability that is a kcy [no~ivator f o r par-allc] proccssing-
succcssivcly larger problems can hc mappccl onto successively larger machines without a loss of
efficiency.
When developing numerical methods, c)r applying them to tbc simulation and design of
cnginccring components, it inevitably bccomcs ncccssary to examine the scaling of the method
with a problcm’s electrical size. ‘1’hc scaling results from tbc original mathematical dcvclopnwnt--
for example, a dense systcm of equations in the solution of integral equations-—as WCII as the
specific numcrica] itllplcr]lclltatioll. Scaling of the numctical implementation depends upon many
factors- for example, direct or iterative methods for solution of tllc linear systcnl- as well as the
computer archi(ezturc used in tbc simulation. in the rest of this paper, scalability will bc divided
into two components; first, algorithmic scalability, and second, scalability of the. numerical
algoritim~ specifically on palalle.1 computer systems. Algorithmic scalabi]ty refers LO the amount of
computer memory and time, ncc.dcd to cc)mplctc aII accurate solution as a function of tllc electrical
sim of [I]c prob]cm. Scal:il)iljty 011 a paral]cl com])utcr sys IcIn refers to the ability of an algorittlm
4
to achieve performance proportional to the nlln~twr of pmccssors l~ing used as outlined ribovc.
The sequential irll~)lclllcl~tatioll and scaling is initial prcscntc~i with tile Paralicl inqdcmcnlation
following. This progression is meant to illustrate the differences in using current parallel platforms
from sequential machines, and the resulting savings. ~lcarl y, di ffcrcnt mathcmat ical formulations
can lead to al(crnatc numerical implementations and different scalings. ‘1’hc objective of any
numerical mc(hod is to provide an accurate simulation of the measurable quantities in an amount of
time that is useful for engineering design. With this objective, time to solution (wall clock time.)
along with problem sim arc the key parameters plotted or tabulated.
3. INIEGRAI. NQUAIION FORMUI,ATIONS
The method of moments is a traditional algorithm used for the solution of a surfac~ integra]
equation [7,8]. This technique can be applied to impenetrable and homogcnous objects, objects
where an impedance boundary condition acmra[cly models inhomogcnous materials OJ coatin[:s,
and those inhomogcmous objects that allow the use of ~~rccn’s functions specific to the geometry.
A dense system of equations results from discretiz.ing the surface basis functions in a picccwise
continuous set, with this system being solved in various ways. The components of method of
moment solutions that affect the algorithmic scalability arc the matrix fill, and solution. A system of
equations
AX= II (3)
results from the method of moments, where A generally is an non-symmetric conqdcx-valued
square matrix, and II and X arc complex-valued rectangular matrices when multiple excitations
and solution vectors arc present. I’hc solution of (3) is most conveniently found by an 1 SJ
factori~at ion
A =- I.~J (4)
where 1. and [J arc lower and upper triangular factors of A. ‘1’hc solution for X is computed by
successive forward and back substitutions to SOIVC the triangular systems
5
I,Y = B, Ux =-Y. (5)
Ikcausc (hc system in (3) in not generally positive definite, rows of A arc permuted in the
factori7jation, Icading to a stable algorithm [9]. Table 1 is a listing of computer storage and time
scalings for a range of problcm sizes when using standard 1.U decomposition factorization, and
forward and backward solu[ion algorithms readily available [10]. The table is divided along
columns into problem siz,c, factori7.ation and solution components. ‘1’hc first column fixes the
memory size of the machine being used; the INI mber of unknowns and surface area nmdclcd
(assuming 200 unknowns/k2), based on this memory arc in the next columns. The factor and
SOIVC times are based on the performance of three classes of rnachinc, current high-end workstation
(O. 1 Gigaflops), current supercomputcr (10 Gigaflops), and next generation computer (1000
Gigaflops). Similarly, the rows are divided into the top four which would typically correspond to
current generation workstations, and the lower four that correspond to supcrcomputcr class
machines. Duc to the nature of the dense matrix data structures [ 10] in the factorization algori{l mls,
this component of the calculation can be highly cfficicnt on general computers. A value of 80%
cffrcicncy of the peak machine performance is used for the time scali rigs. The backward and
forward solution algorithms though operate sequentially on triangular matrix systems, resulting in
reduced performance, and a 50% efficiency is used in these columns. ‘I’his performance will
incrcasc when many excitations (right-hand sides) arc involvccl in the calculation, resulting in
performance closer to peak.
‘1’able I. Scaling of mclhod of morncnts matrix fadorimtion and solution algm i(hlns.—.. .——. .—.. -- —— ---..— —
y;~:y’y:f:~~?:21”--~~:]:~j:;;::::~———.—- .——
Gflops machine performance-. _ . JL-_.~...____—___
55iij$--%:~$zd?~iz
12.2 reins 0.12 reins 0.001 reins 1.25 sees 10T sees
8192 ,
H~ “
32,768 45,254131,072- “90,509524,28~ 181,019
::~~~~-flK?~:~
L
.— .—m%ii2.5x10T
5.0x-toT--m6i-
3————104 reins
5.5xlo-— .
0.002
T6’6F-
‘1’he time for factorization, scaling as N3, can limit lhc lime of the calculation for large
problems, even if the matrix can b assembled and stored. An altcmativc to direct factorization
mctho(is is the usc of i(cra(ivc solvers of the dense syshm [ 11,12]. l’hcse mctbods arc even more
uscfLl] if multiple right hami sides arc prcscnl, since I})c itcxativc solvers exploit information in
[hcsc. vectors [11]. If the convcrgcncc rate can bc contr-ollc(i, anti the number of iterations rcc}uired
(o compklc a solution arc limited, the solution time can 1x. l-cciuccd as compare~i to the direct
factorization methods.
l’hc direct limitation of integral equation methods though, is the mcmoly nccdcd to store the
dense matrix. 1.argcr problems can only bc simulated by circumventing this tmtticncck. one
approach to this probhm is to LISC higher-c)rcicr parametric basis functions which rcducx the
number of unknowns nccdcd to model sections of (})c surface [ 13]. 11 owcvcr, it is clear that the
growth in memory required for storage of the dense complex impcdancc matrix (N*), cannot bc
ovcrcomc by only sligh(ly reducing the size of N or by increasing ti~c number of computer
processors applied (o (}]c problcm. Alternative mctho(is arc then dc.sirablc.
onc such mc.thoci uscs special types of basis functions to ]N’OdLICC. a sparse impc~iancc
matrix [14 ], reducing (hc stora,gc from tt~c N7 to UN where a is a constant in(ic.pcncicnt of N. ‘I”hc
resulting sparse syslcm can be solvc(i using tile mcti~o(is
olhcr classes of mc[}wds for gene.l-atinp, an(i solving IIlc
dcscribcxi in lat(cr sections of this paper.
impc(iancc mati-ix while reducing stol agc
-/
arc summarizmt in [ 15], while another is the. fast mul(ipolc method [ 16, 17]. ‘1’his approach
ciccomposcs the impcdancc matrix in a manner that also rcduccs the ncdcd storage to cxN. The
fast multipo]c method has bczn parallclimd in [ 18], where it is suggested that an Intel Paragon
system with 512 nodes each containing 32 Mbytes of memory could SOIVC a problem With 250,000
basis functions.
3.a Scabilily on Parallel Computers
To specifically examine parallel stability, the electric-field integral equation model
clcvcloped in the PATCIl code [ 19,20] is used. This code uses a triangularly faceted surface
model of the object being modeled, and builds a complex dense matrix system (identical to that
given in equation (3))
7#I=v (6)
where
(7)
and i and j arc indices on the edges of the surface facets, G is the Green’s function for an
unbounded homogeneous space, ~ and ;7 am arbitrary source and observation points, and { ‘l’i,
1~} arc current expansion and testing functions. The impcdancc matrix (Z) is factored by means of
an LLJ factorization, and for each vector V a forward and backward substitution is performed to
obtain 1, the unknown currents on the edges of the surface facets. It is then a simple matter to
compute radar cross scztion or other field quantities by a forward integral. ‘l”he elements of the
PATC1 I code considered in the parallc]iz,ation arc matrix fill, matrix factorization and solution of
one or many right hand sides, and the calculation of field quantities. The computation cost of these
three clcmcnts must be examined in relation to increasing problcm size and increasing number of
processors in order to unde.rstancl the scalability of the I’Al’C} I code.
3.a.i M a t r i x ltqnation Nill
8
Since the in~pcdancc ma[rix is a complex ckmsc matrix of size N, where N is (Ilc number of
edges used in the facctcd surface of the object being modeled, the matrix has N 2 clcmcnts, and
filling this matrix scales as N*. onc method for rcctucing the amount of time spent in this
operation wllcl~tlsi[~g aparallcl col~~l>lltcr istosllrcad the frxc.d numbcrofclcmcntsto be computed
over a Iargc number of proccsscws. Since the amount of computation is thcmclically fixed
(ncglecti ng communication between processors), applying 1’ processors to this task should provide
a [imc rcduct ion of ~’. 1 lowevcr, the computations required by the PATC.I I code’s basis functions
involve calculations performed at the ccntcr of each patch that contribute to the matrix clcmcnts
associated with the three current functions on the edges of that patch. The sequential I’Al<Cl I code
will loop over the surface patches and compute partial integrals for each of the three edges that
make up that patch. This algorithm is quite cffrcie.nt in a sequential code, but is not as appropriate
for a parallel code where the three edges of both the source and testing patch (corresponding to the
row and column indices of the matrix) will not generally be located in the same processor. “l”tlc
paral]cl version of I’Al’Cl I curl cnt]y ont y LISCS ttlis ca]cu]atiori fOr t]losc matrix c]cmcnts t]lat arc
local to the processor doing tllc computation inMXhlcing an inefficiency compared to a scqlcntial
algorithm. This inefficiency is specific to the integration algorithm used, and can be ren~ovcd. for
example, by communicating partial results computed in onc processor to the processors that INXXI
thcm. This step has not bcm taken at this time in the parallel PATCH code, duc partially to the
complexity it adds to the code.
3.a.ii Matrix Equation E’actorization and S o l u t i o n
When the impedance matrix is assembled, the solution is complctcd by a Ill factorization,
scaling as M], and a forward and backward substitution used to SOIVC for a given right hand side,
scaling as N2. ‘1’hc choice of the 1,~) factol”izatiml algorithm determines the matrix decomposition
schcmc used on the set of processors.
[)IIC s[y]c of decomposition that is suitable for usc with 1.lN1’ACK [21] factorization
mutincs is to partition the. matrix
9
[1A,
A=”:
A (P.-1)
(8)
where Al E Cm”~ and mi = N / 1’. l~ach submatrix A i is then assigned to processor i, where }’ is
the number of processors. The 1.INPACK mcthcxl involves many BI.AS 1 (vector-vector)
operations [22], which perform at 25 (o 70 MN .01’S on the Cray 1311 (150 Ml:l .01’S peak
performance.) ‘IIesc type operations perform poorly on the “1’3D and other hierarchical memory
c.omputcrs because the amount of work that is performed on the data brought from memory to the
cache then to the processor is similar in six to the amount of that data.
Another type of decomposition is to assume that the physical processors, I’ti form a logical
two-dimensional p, x p, array, where i andj refer to the row and column index of the processor.
Then the simplest assignment of matrix clcincnts to processors in a block method is to pal~ition the.
matrix
A =
AM ..” AO(I)C, )
““””; - “ - 1 ” 1 ” -=—.—...—-—. .—. .—. .— ..— —
(A (Pr-l)o ““” A(P, -l)(P C-l)
(9)
where Ati ECm’xnJ, mi =Nlpr, and nj =AJlpc. Submatrix Au is then assigned to processor
Pij. “J’his is an appropriate decomposition for usc with 1.APACK routines [10] that usc BLAS 3
(matrix-matrix) operations [23] and can perform at a high rate, typically 100 to 120 MFI.OPS on a
‘1’31> processor. These operations perform a large amount of work on the data that has been
brought into the cache, compared with 111 .AS 1 operations. 1 lowcvcr, this simple dczornposition
docsn’t provide good load balance. It can bc overcome by blocking the matrix into much smaller
k x k submatriccs, and wrapping these in two dimensions onto the logical processor array. In
other words, partition
Am ..” AO(J,.,)
:1-.1_—— — .V. . . .
(lo)
where M = N /k, and atl blocks arc of size k x k. Then block Ati is assigned to processor
I ’( i” t i p, )(jn~p{ ) . This is the partitioning strategy that is used by I’AT’Cl 1. (On the T3D, k = 32 has
been found to provide optimal results between load balance demanding the smallest possible k and
the performance of the BLAS 3 operations requiring a large value for k.).
I’hc I’ATCI 1 code uses a matrix equation solver-narnti BSOLVE [24]---bascd on this
deeomposition . Figure 1 show the totat and per processor performance for 13S01 .VE for the
largest matrix that can be solved on each size of machine. Total time scales at the same efficiency
as total performance, and is 25 minutes for the 30,240 size matrix on 256 processors.
11
PERFORMANCE - OF MATRIX SOLUTION ALGORITHM
120-- ---------—-
GQ- loo~–~gLl_~
8 0 _ ---bU)ma
: 60 :-—
%c1
g 40 ---—5340
E 3780$ 20 - -–—m
0 ..–=_T_7TT,
1
Cray T3D_.. .—. . -
30240
15120
.—-. —— .——..
&0480
—.——
(es imate)
—- ——-— .—
o 100 1000 1(
Number of Processors
I:igure 1. Pcrformancc for 13 S01.VE (includes factoring malrix, estimating
condition number, and solving for one right hand side.)
Each processor of the T3D (60 Mbytes usable memory) can hold in memory a matrix pirxc
of size 1890x 1890. ‘1’tlc largest 3’31> has 1024 processors and can store a matrix of size 60,480 x
60,480 and can be factored in about 100 minutes. Problems in this class (those that are limited
only by memory rcquircmcnts and not by run-time rcquircmcnts) arc good candidates for out-of-
corc methods. A larger matrix cqua[ion is stored on disk, and portions of the matrix arc Ioadcd
into memory and factored. l;or an out-of-core solution to be cfficicnt, the work involved in
loading part of a problcm from disk must be overlapped with the work performed in solving (1E
problcm, so that storing the matrix on disk doesn’( significantly incrcasc the run ~imc over what il
would be on a machine with more JllClllOry. Spccialimt matrix ccluation solvers have been
12
TIME FOR LU FACTORIZA?”ION
o 8<
64 E3i[ Complex Arithmetic, F’artial
256 PEs_r.
-—-l--r T---
)
512 PE’SOoc~r-–
102I
-Wtmmi3.,
_.— —
006A
PES
— - l - - T - - - T -
)
voting
(
d6 7 6 PEs
—
e196 PES
Ooc
)C
I
g T3D
~ Paragon
A T90 (32 CPUS)
+ C90 (16 CPUS)
[_] Delta— .
—
) 100 1 0UNKNC)WNS (1OOO)
Figure 2. CPU time vs. number of unknowns for 64 bit complex dense direct
solvers with partial pivoting for various machines (Cray T3D, T90 and C90;
Intel Paragon and Delta). Processing elements (PEs) used in calculation shown
for parallel computers. lo-core and out-of-core (OOC) results specified.
dcvclopcd to efficiently exploit the large disc memory available on many parallel machines. ~~ig,ure
2 shows germ-al results for dense solvers, both in core and out of core, for various machines.
3.a.iii computation o f obse rvab l e
l-he obscrvables of the I’Al’Cl I code, such as radar cross section information or near or far
fickl quantities, can be easily computed once
known. q’hcsc calculations invo]vc forward
the currents on the edges of the surfacz patches are
intc:,rals of the currents and the frccspacc Ch-cc.n’s
13
func[ion. DUC to the discrctizcd of this current, this int%ration results in a summalion of field
components duc to each current basis function. Since these currents arc distributed over all IIlc
processors in a parallel simulation, partial sums can be performed on the individual processors,
followed by a global summation of these partial rcsuhs to find the total SUIII. This calculation scale
quite WC1l since the only overhead is the single global sum.
3.a.iv Total Performance
Figure 3 shows overall code timings for a scaled size problcm. Each case involves a ma[rix
that fills the same fraction of each processor’s memory. As the problem size increases, more
proccssorx are used in the calculation. It is clear that the matrix fill is dominant in the
PATCH SCALING FOR SCALED SIZE PROBLEMS
code, and in
100 a
1o”.
@g
1-,—–-F.—1-
0.1
[6 PEs
!v—...—Total
FILL
—
FACTOR
.—4$----L...RCS I
Cray T3D
MuiIK-L.-r
—-
15=/——
———. . . .A-
--T—l—T—
_——
——
——
-. . .
IL
.—-
------- -.
l--
—.——256
-------- -+
L–0 2 4 6 8 10 12 14
MATRIX DIMENSION (X 1000)
I:igurc 3. Scalability for I’Al’Cl I code for scaled siz.c pmblcms,
‘Es
3 18
14
fact, [hc crossover point bc[wccn the 0(n3) factoriz.atiol~ and tllc 0(~~2) fill l~as Ilot t=n rcachcd,
Also, {he time involved in the radar cross scc[ion calclllation is SIIOWII tO d~reasc as IIIC number of
processors is increased, as was dcscribcd in the previous paragraph. It may also be observed that
the matrix solve tirnc, also 0(n2), parallels the fill lime quite WCI1.
Figure 4 shows scalings for the fixed size problems on the T3D. Iiach of the problems
shown is run on the smallest set of processors needed to hold the matrix data, and then on larger
numbers. The total code time initially dccrcases linearly with the number of processors, leveling
off as the amount of communication time begins to bccomc a larger fraction of the total time needed
to comp]etc the calculations.
PATCH SCALING FOR FIXED SIZE PROBLEMS. . Cray 13Dw ————-—
[ I
2 5 - “ --~– 5832 -+4– 16200 . — — — — .
~20- ---&_
t~ 15co.—5
0 I I - - - - 1 I I0 64 128
Number of Processors
1 ‘igure 4. Scalability y for 1’ATCI I code, for fixed siz,c problems.
192 256
‘1’~)C 16,200 unknown
case was only run on 256 processors.
15
4. ll’]NIrl’]~ ]\],];~f]~Nrj’ ]to]~~~(J],A’I’]ONs”
Volumetric modeling by (he usc of an integral cqua(ion can also be usccl in simulations,
tlloLlgh the available memory of currcn[ or plannccl tcchnolog y great] y limits the siz.c of problcn E
[l]a[ can be moclclcd. IIccausc of (his limitation in lnodcling t}lrcc-dimensional space by intcgr-al
equations, finite clcmcnt solutions of the partial differential ccluations that lead to sparse systems of
cquat ions arc common] y used [25]. A finite clcmcnt model is natural when the problcm contains
inhomogcnous material regions that surface integral cquatioll methods arc either incapable of
modc]ing, or arc very costly to model. ‘1’hc problcm domain is broken into a finite clcmcnt basis
function set used to discretizc the fields. The resulting linear system of equations–-rather than
scaling as the iV2 storage of the method of moments--scalcs as nlN where m is the average number
of non-zero matrix equation clcmcnts pcr row of the sparse 1 incar systcm. This value is dependent
upon the order of the finite element USCCJ, but is typically between 10 and 100, and is indcpcndcnt
of the size of the mesh. l~or a 6 unknown, vector edge-based tctrahczlral finite clcmcnt [26], m is
typically 16.
‘1’ypicall y, the s ystcm of cquat ions resulting, from a finite clcmcnt discrctiz,at ion is
symmetric; the non-zero structure of a representative example is shown in I;igure 5 (Icf{). A
symmetric fiictorization of this systcrn (Cholcsky factorization) leads to [27]
A = i;Di? =: ~D’’21)’’21;r == 1,1: (11)
where the diagonal matrix is specifically shown distributed symmetrically between the symmetric
factors I., an important consideration when symmetrically applying an incomplete Cholcsky
prcconditioncr in iterative methods. l’hc factorization results in non-zero clcmcnts in L where non-
zeros exist in A, as well as fill-in, or ncw non-~sro entries gcncratcd during the factorization. I;ill-
in requires additional storage for 1., as WCI1 as additional time to complete the factorization. ‘1’o
rcducc the amount of fill-in, the systcm is reordered by applyinc a permutation
(l’Al” )l)X :1’11 (1 2.)
wlmrc the pcrlnutation matrix I’ satisfies l’1’~ z 1. ]n ( 12), I’Al’T Icmains syinmctric, and the
fol-ward and backwar(i substitution pllascs bccornc
16
i.Y = 1’11, l%.= Y, x = I’Tz (13)
wbcrc ~. is the Cholcsky factor of I’AI’T. IJor a sparse factori?.ation, the permutation matrix is
cboscn to minimize the amount of fill-in generated. Since there arc n! possibilities for P to
minimiz.c fill-in, heuristic methods arc used to achicvc a practical minimization, the most comlnoll
being the minimum degree algorithm [28]. I;igure 5 (right) shows the non-zero structure of the
rcprcscntative system after reordering for a canonical scattering problem.
“1’able 2 lists scaling data for problem size when using a Cholcsky factorization with the
minimum degree reordering algorithm used to minimim storage. Based on the computer storage
available, the number of edges in a edge-based tetrahedral mesh [18] along with the number of
,.non-zeros in the factor L [29], and the volume that can be modeled is shown. The volume is
based on the usc of 15,000 tctrahuha per cubic wavelength, corresponding to approximately 25
edges per linear wavelength. ‘l’his number can vary depending on the physical geometry
(curvature, edges, points) and the local nature of tbc fields.
Theoretically, the systcm in (12) general] y is not positive definite, being symmetric
indefinite, and a Cholcsky factorization in this case is numcnca]ly unstable. For the practical
solution of most problems, this has not been found to be a issue. Methods that prcscrvc the
.:
.: . -- .,. .,..
,1
. . .
.:
. . ;.. : . .———-.—— ..-. . . . .._. —’A———A .,
l;igurc 5. Non-zero matrix structure of typical finite clcmcnt simulation. 1,ctt, original structure;
right, structure after reordering to Ininimim bandwidth.
17
symmdric sparsi(y, and allow pivoling to produce mom stable algorithms can be used [28].
‘]’able 2. Scaling of typical factori7atior] finite clcmcnt matrix solution algorithms,~-... . .. ——— —.—.
PROBLEM SIZE 7
T:?aL!zmIf‘k
.—— —128 8 2.0
—.—256 50 16 3.3
512 8~ 32 5,5.—
1024 135 64 9.0
:%;*:“
iioo 512 40
2 , 0 4 8 1 0 8
8,192 2 9 4—
I:rom ‘l’able 2 it is seen that even though the storage for the finite clcmcnt method is linear
in N, the fill-in duc to the usc of factorization algorithms causes the storage to grow as N’4.
1 ,incar storage can be maintained using an iterative solution. Sparse iterative algorithms for
systems resulting from eltmromagnctic simulations recursively construct Krylov subspace basis
vectors that arc used to iteratively improve the solution to the linear systcm. The iterates arc found
from minimizing a norm of the residual
r= Ax-h (14)
at each step of the algorithm. (A single solution vector for a single excitation is shown.) Since the
systcm is complex-valued indefinite, rncthods appropriate for this class of systcm such as bi-
conjugate gradient, gcncrali?,ed minimal residual, and the quasi-minimal residual algorithm are
applied [30]. lhcy all require a matrix-vector multiply, and a set of vector inner products for the
calculation. ‘1’hc iterative algorittlms require the storage of tllc matrix and a fcw vectors of length
N. When only the matrix and a fcw vectors nmd to bc stored, problems of very large size can bc
handled, if the convcrgcncc rate is control] cd. ‘1’tlc number of itci-ations (with a sparse matrix-
d~r~sc vector multiply accollnting for over 90fz0 of the time at Cach step of t}lc iterative algorithm)
dctcrmincs the time to solution.
18
l]cc.ause the matrix-vector multiply dominates the KIY1OV iterative mcthods, the algorithmic
scaling is found from this operation. A single. SpaI-SC matrix-dcrlsc v~tor multiply requires mAI
operations, and if there arc 1 [otal iterations required for convcrgcncc, the n(mltwr of floating point
operations nczded is I* mN. A typical solution of the systcm of equations, without the application
of a prcconditioncr, may require a number of iterations 1 = ~N, producing the squcntial algorithm
scaling
,,,N% (15)
for the solution of a sing,lc right-hand-side. The algorithmic scaling for a direct factorization
method is 0(m2N) [28, page 104]; therefore the factorization methods-—when memory is sufficient
to hold the fill-in entries---give considerable CPU time savings in the solution. Iiurthcr advantage
is also gained over iterative methods when using a direct factorization. Modern computer
architectures arc typically much ICSS cffrcicnt at performing the sparse operations required in the
sparse matrix-dense vector mutip]y of [hc iterative algorithm, as compared to operations on dense
matrix systems. Current direct facloriz.ation methods attempt to usc block algorithms, exploiting
dense matrix sub-structure in the sparse systcm, and therefore increasing the performance of the
factorization, further improving the performance from that given by the algorithmic scaling
differences.
It is seen that the number of iterations dircztly incrcascs the time to solution in the iterative
methods. This number can bc control]cd to some dcgrm by the usc of preconditioning methods
that attempt to transform the matrix equation into onc with more favorable properties for an iterative
solution. To control the convcrgctlcc rate, the matrix A should be scaled by a diagonal matrix that
procluccs ones along the diagonal. This scaling removes the dcpendcncc on different clcmcnt sires
in a mesh. A prczonditioncr M can then be symmetrically applied to transform the systcm giving
M-lATW1( MX)==M-l l). (16)
‘1’IIc right hand side vcztor is initially transformc.d, and tllc system is them solved for the
intcmmdiatc vector Y =- hflx, multiplying this vector by h4”1, A and M-* in succession at each
iterative step. When the solution has convcrgc.d, x is rccovcrcd from x. “1’hc closer M is to 1, in
(4), the quicker the transformed sys[cm will converge (0 a solution. A comn~on preconctitioncr is
an incomplete Cholcsky factorization [31] where M is chosen as a piccc of the factor 1. in (11). It
is computed to keep some fraction of the true factori~,ation clcmcnls, llw exact number and sparsity
location of the clcmcn[ dependent on the exact algorithm used. A uscfu] form of incomplete
factorization keeps the same number of clcmcnts in the incomplete factor as there arc in A. l’his
requires three times the number of opcrat ions at each i[crativc step; therefore the time to solul ion
will be dccrcascd if the number of iterations is lesser than onc-thirxi when applying this
prcconditioncr.
When the right hand side consists of a number of vectors, newly dcvclopcd block nlcthods
can be applied to the system to usc the additional right hand sides to improve the convergence rate
[32,33].
4.a Scalability on Parallel Cotnputers
III a finite clement algorithm, the resultant sparse syste.rn of equations is stored within a data
structure that holds only the non-zmo entries of the sparse system. I’his sparse system must
ultimately be distributed over the parallel computer, requiring spczial algorithms to either break the
original finite clement mesh up into specially formed contiguous pieces, or by distributing up the
matrix entries themselves onto the processors of the computer. As in the dense method of
moments solution, the pieces arc distributed in a manner that allows for an cfficicnt solution of the
matrix equation systcm.
4.a.i The Finite Element Mesh and the Sparse Matrix ICquation
‘1’hc volumetric region (V) is cncloscd by a surface ( W), in which a finite clcmcnt
discrctiz.ation of a weak form of the wave ccluation is used to model the geometry and fields
20
~~ is lhc magnetic field (the ~~ -equation is used in this paper; a dual z’ -equation can also bc
writkm), w is a testing function, the asterisk denotes conjugation, and F; x ii is (1IC tangential
component of ~; on the bounding surface. In (17), c, and //, arc the relative pcrmittivity and
permeability, rcspcctivc]y, and kO and ?)O are free-space wave number anti impedance, respectively.
A set of finite clement basis functions, the tctrahcdrai, wctor edge cicmcnts (W%itncy clcmcnts)
will bc used to discretiz,c (17),
Wnm(r) == Am(r)van(r)- I v a n , (18)
wiva-c A(r) arc the tctrahcdrai shape functions an(i indices (m, n) refer to the two nodal points of
each edge of the finite clcmcnt mesh. These clcrncnts will bc used for both expansion and testing
(Galcrkin’s mcti~oci) in the flnitc element domain. ~~ecausc of the local nature of ( ~ ‘/), (sut)-domain
basis functions and no Green’s function involved in the inmgra(ion of the fields), the systcrn of
equations resulting from the integration only contains non-z.cro entries when ttlc finite elements
overlap, or arc contiguous at an edge. Because the mesh is unstructured, containing elements of
different size and orientation conforming to the geometry, the resultant xnatrix equation will have a
sparsity structure that is also unstructured.
Ti~c sparsity structure is further aitcrcd by the form of the Sommcrfcld boundary condition
applied on the surface S. When local, s ymnmtric absorbing conditions are applied on the boundary
[34]-- entering into the calculation through the surface. intcgrai in(17)- a matrix witil the structure
shown in l~igurc 5 (left) results. It is seen that the diagonal is entirely fiiicd, corresponding to the
self-terms in the voiumc integral in (17), with the m non-zero entries scattered along ti~c row (or
column) of tile symmetric matrix. I’hc location of these entries is conlplctcly dcpcndcnt upon tllc
ordering of the edges of the tetrahedral clcmcnts used in the discrctiz.ation. lf a diffcrcl)t shape or
or[icr of the cicmcnts arc used, the non-z,cro s[ructurc wiii differ slightly froln the onc shown.
Wilcn an integral equation method is usc{i to truncalc hc mesh [35,36], a (icnsc block of clcmcnts
will appear in the lower right of the system (when (tic edges of the frnitc clcmcnt mesh on t}lc,
boundary arc ordered last) as shown in Flgllrc 6 (left). ‘1’hc integral c.quation approach to
truncating the mesh uscs the finite clement facets on the bo~lndary as source fields in an intcgra~
equation, resulting in a formulation for this piece of tllc calclllation similar to that in Swtion 3, and
with an amount of storage needed for the dense Inatrix as a fLlnction of the electrical surface area
tabu]atcd in Section 2. To circumvent the large dense storage nccdcd with this application of a
global boundary condition, a surface of revolution can be used to truncate the mesh [37,38], using
a set of global basis functions to discretize the integral equation on this surface. This results in a
systcm similar to that in Fig 6 (right) containing very small diagonal blocks due to the orthogonal
global basis functions along tbc surface of rcvolLltion truncating the mesh, as well as a matrix
coupling the basis functions in the integral equation solution to the finite element basis function on
the surface. These coupling terms are the banded thin rtztangular matrices symmetric about the
diagonal of the matrix. Other forms of the integral equation solution, as outlined in Section 3, can
also be used to discretize the integral equation modeling fields on the mesh boundary, leading to
slight variations of the matrix systems shown in l:igurc 6.
22
—. ,.,. .
-k._. .
.:
.: .*.-,; ~- . .
. .,.. -,
.x- , . .:.
.! i
.:.
. . . ..-O -.
. . ;.
.
L
~ ‘w
,..4.
-.,-
. . .
,.. .
,’ or.
. . . . . . .
. .
.-.. ..
. . . .... .
.:
L—— ->.
Figure 6. Non-z,cro matrix sparsity swucturc for system with ctcnsc surface
integral equation boundary condition applied (Icft), and surface of revolution
integral equation boundary condition (right). l’he mesh has 5,343 edges with
936 of those on the boundary.
“]’hc systems graphically rcprcscn[cd in Figure (6) gcncrall y have the form
[.: TM] (19)
where K is the sparse, symmetric finite clement matrix, found from the volume integral in ( 17), C
can be termed the coupling matrices that arc interactions bctwccn the fmitc elements at the boundary
and the integral equation basis functions, and Y. rcprcscnts the integral equation, method of
moments matrix entries. “1’hc symbol t indicates the adjoint of a matrix. 11 is the vector of
magnetic field coefficients for each finite clcmcnt, and I represents the equivalent current basis
functions on the boundary of the mesh. For a scattering problem formulation, the incident field
couples only to the integral equation boundary, and is rcprcscntcd as V. I ‘or radiation problems
the O and V vcztors arc interchanged since the impressed source is modeled in the mesh. I >iffcring
formulations lead to variations in (19), but the general algebraic nature is prcscrvcci. ‘1’o exploit the
sparsit y of K in ( 19), the systcm is solved in two steps by initially substituting 11 == - K- lCI from
the first equation in (19) into the sezond, producing
(Z– C+ K-’C)I : V . (20)
23
‘]’his system h a size on he order of the numtxr of basis fllnc(ions in the integral quation model,
is rtcmsc, and can be solved by either direct factori~.ation or iterative means as o~~tlincct in Section 3.
“]’hc intermediate calculation, KX == C is the sparse systcm of equations to be SOIVCCI, producing
X.
The solution of this sparse systcm on a parallel computer requires
“1’raditionally, the depcndcncc between mesh data and the resultant
it to be distributed.
sparse matrix da!a
is exploited in the dcvcloprncnt of mesh partitioning algorithms [39-42,55]. These algorithms
break the physical mesh or its graph into contiguoLls picccs that arc then read into each processor of
a ctistributcd memory machine. The rncsh pieces arc generated to have roughly the same number of
frnitc clcmcnts, and to some measure, each piece has minimal surface area. Since the rnati~x
assembly routine (the volume integral in (17)) gcncratcs non-zro matrix entries that correspond to
the direct interconnection of finite clcrncnts, the mesh partitioning algorithm attempts to crcatc a
load balanec of the sparse systc]n of equations. l’roccssor commLmications in the algorithm that
SOIVCS the sparse system is limited by the abi 1 it y to minimize the surface area of each mesh piccc.
Mesh pallitioning algorithms arc generally divided into multilevel spectral partitioning,
geometric partitioning, and rnultilcvel graph partitioning. Spectral partitioning methods [40,4 1]
creates cigenvutors associated with the sparse matrix, and uscs this information to recursively
break the mesh into roughly equal picccs. It requires the rncsh connectivity information as input,
and returns lists of finite clcmcnts for each processor. Geometric partitioning [39] is an intuitive
proccdurc that divides the finite clcmcnt mesh into picccs based on the geometric (node x, y, z
coordinates) of the finite clement mesh. This algorithm requires the mesh connectivity as well as
the node spalial coordinates and returns lists of finite clcmcnts for each processor. Graph
pa[li(ioning [42] operates on the graph of the finite clcmcnt mesh (mesh connectivity information)
to collapse (or coarsen) vcrticcs and
into picccs, and then Llncoarscncd
parallel processors. “1’hc input and
edges into a smaller graph. ‘1’his smaller graph is partitioned
and refined for the final partitions of finite clcmcnts for the
outpLlt is idcntica! to Spcclral methods. MLdtilevcl algorithms
o p e r a t e by J)crformin~ mLlltiplc. stages of the pallitionin~ simultancoLlsly, acczlcrating the
24
a]~orithm. Most of the algorithms and their offshoots perform similarly in praclicc, wilh the
spcctt-al and graph partitioning algorithms being simpler to usc sin~c t~~cy do not need geometry
information. An alternative to t}lcsc mesh partitioning algoritlmls is a method that divides the
matrix entries directly, without operating on the finite clcmcnt mesh, and will be examined in
Section 4.a. iv.
IIiffcrcnt decompositions arc used depending whether direct factorization or iterative
methods arc used in the solution. Ilccompositions for iterative solutions, as well as the iterative
methods thcmsclvcs have shown greater case in parallclization than direct factorization methods.
Both approaches will now bc considered.
4.a.ii Direct Sparse Factorization Methods
Direct factorization rncthods require a sequence of four steps; reordering of the sparse
systcm to minimize fill-in, a symbolic factorization stage to dctcrminc the structure and storage of
~. in (13), the numeric factorization producing the complex-valued entries of ~., and the triangular
forward and backward solutions. The fundamental difficulty in the parallel sparse factorization is
the development of an efficient reordering algorithm that minimizes the fill-in and will scale well
on distributed memory machines, controlling the amount of communication necessary in the
computation. The minimum degree algorithm typically used in sequential packages is inherently
non-parallel, proceeding scqtrcntially in the elimination of nodes in the graph representing the non-
zcro structure of the matrix. Other algorithms for reordering, as WCII as (}1c following symbolic
and numeric factorization steps that depend on this ordering arc under study [43]. Current
factorization algorithms [44,45,46] can exhibit fast parallel solution times on rnodcratcly large
sized problems, but arc dcpcndcmt on the relative structure of the mesh, whether or not the problcm
is two or three-dimensional, and the relative sparsity of the non-?.cro e.ntrics. For problems with
more structure and ICSS sparsity, ]Iigllcr pformancc is found by these sparse fac(ori?,ation solvers.
25
4.a.iii Sparse Iterative Solution Methods
1 ‘m a paralld irll}llclllcrlta(io:] of the sparse itcrat ivc solvers introduced above, a
cfccomposition of the matrix onto the processors that minimim communication of the Ovcrlappi[lg
veetor piccc.s in the parallel matrix-vector multiply of the iterative algorithm, reduces storage of the
rmullant dense v~tor pieces on each processor, and allows for load balance in s(oragc and
computation is rc~uirccl. Various parallel packages have been written that accomplish these goals
to some (icgrcc [47,48]. The mesh decompositions outlined previously can be used and integrated
with the parallel iterative algorithm to solve the systcm.
Ah-natively, a relatively simple approach that divides the sparse matrix entries among the
distributed memory processors ean be employed [49]. “1’hc matrix is decomposed in this
implementation into row slabs of the sparse reordered system. The reordering is chosen to
minimim and equalize the bandwidth of each row over the system [ 17,18] (as shown in Fig .5
right) since the anmL1nt of data communicated in the matrix-vector multiply will depend upon the
combination of equalizing the row bandwidth as well as lninimiz,ing it. A row slab matrix
decomposition strikes a balanec between near perfect data and computational load balance among
the processors, minimal but not perfectly optinial eornmunication of data in the matli.x-veclor
multiply operation, and scalability of simulating larger sized problems on greater numbers of
pmecssors. Since the right-hand-side vectors in the parallel sparse matrix equation ( KX =: C ) are
the columns of C, the.sc columns are distributed as required by the row distribution of K. When
setting L]p the row slab dcconlposi(ion, K is sp]it by attempting to equalize the number of non-
~,cros in each processor’s portion of K (composed of consecutive rows of K). The rows in a
given processors portion of K determines the rows of C that processor will contain. As an
example, if the total number of non-z,eros in K is nz, a loop over the rows of K will be executed,
counting the number of non-z,cros of K in the rows examined. When this number becomes
alqmximatcly nZ / P (where P is the numbm of processors that will be used by the matrix
C.qllation SOIVCI”), the set of rows of K for a givcu processor has been dctcrmincd, as has the set Of
tows of C .
26
‘I’}~cn~alrix dw.ol~~positioI~ co(ic used in this cxamplc consists of a nLmlbcrof subroutines;
initially, the potentially large mesh files arc read (RllAl~), tl~cn t}~c conncclivily structure of the
sparse matrix is generated and reordered (CONNIC1’), followed by the generation of the complc.x-
valued entries of K (FIIM), bLlilding the connectivity structure and filling the C matrix
(COUPI.lNG). Finally the individual files containing the row slabs of K and the row slabs of C
Inust be written to disk (WRITE). I ‘or each processor that will be Llsed in the matrix equation
solver, one file containing the appropriate parts of both the K and C matrices is written. IJigurc 7
shows the performance of these routines over varying numbers of processors for a problem
simulating scattering from a dielectric cylinder modeled by 43,791 edges. The parallel times on a
Cray T3D arc compared against the code running sequentially on one processor of a 190. As
mentioned above, the reordering algorithm and the algorithm gcnirating the matrix connczlivity arc
fundatncntally sequential. These routines do not show high efficiency when using multiple
processors-the time for this algorithm is basically flat--whereas for routines that can be
parallclizcd (FEM, COUPI .ING and WR1”l”li), doubling the number of processors rcduccs the
amount of time by a factor of approximately two. The time for reading the mesh is bound by UO
rates of the computer, and the time for writing the decomposed matrix data varies slightly for the
128 and 256 processor cases due to other users also doing I/O on the systcm. As will bc shown in
the next result, a kcy point of this approach to matrix’ decomposition is that the total time needed
(Icss than 100 sec. on 8 processors) is substantially lCSS than the time nccdcd for solving the linear
systcm, and any inefficiencies here are ICSS important than those in the iterative solver.
?$7
160.00m
1
1
140.00-
20.00 J
00.00
80.00
60.00
40.00 \
20.00
0.00 1—
T
SCALING OF SPARSE SLICE ALGORITHM
T90 8
❑ WRITE
❑ COUPLING
❑ FEM
❑ CONNECT
M-=3--
1 6 3 2 ” 6 4- 1 2 8 ” 2 5 6 ’Number of Processors
1 ‘igurc 7. Computation time and scaling for a relatively small simulation
(dielectric cylinder with 43,791 edges, raclius = 1 cm, height = 10 cm,
pcrmittivity == 4.0 at 5.0 (3117.). First column shows time for single
processor ‘1’90. Times on 190 for CC)NNKT and FEM have been
combined.
In this example, quasi-minimum residual algorithm [52] is used to SOIVC (bc sparse systcm
of equations KX = C. With the row slab dccomposi(ion used, tbc machine is logically considered
to bc a linear array of processors, with each slab of da(a residing in one of the processors. Central
components of the quasi-minimum residual algorithm (hat arc affcztcd by [hc usc of a distributed
memory machine arc the parallel sparse Inalrix- dense vector multiply, and do( products and norln
calculations (hat need vector data distributed over the macbinc. ‘1’hc dominant component is the
matrix- vcc.tor multiply, accounting for approximately 80% of (I)c time. re.quircd in a solution. ‘1’hc
parallel sparse nla(rix-d~I~sc WX(OI Il)u](jply jllvolvc,s Illul(lplyjng ttlc K nlatrix that js djs(rjbutcd
2.8
COMMUNICATION FROM/ PflOCESSOFl TO LEF:T
.dn,,g
~. ——.
‘$! d bm , =1
COLUMNS ‘
x ~1‘.=
.
\
YLOCAL PROCESSOR ROWS
\
LOCAL PROCESSOR ROWS
COMMUNICATION FROMPROCESSOR TO RIGHT
Figure 8. heal sparse ma(rix-dense vector multiply graphically displayed.
across the processors in row slabs, each containing a roughly equal number of non-zero clcmcnts,
and a dense vector x, that is also distributed over the processors, to form a product vector y,
distributed as is x (Figure 8). Since the K matrix has been reordcrcci for minimum bandwidth, the
minimum and maximum column indices of the slab arc known. If the piece of the dense vector x
local to this processor has indices within this extent of column indices, the mukiply may be dcmc
locally and the resultant vector y will be purely local. In gcncra}j the local row indices of the dense
vector x do not contain the range of column indices; therefore a communication step is wquired to
obtain the portions of the multiply vector x required by the column indices of the K matrix, This
communication step only requires data from a fcw processors to the left and right. The exact
number of processors communicating data is dcpcndcnt on the row bandwidth of the local piece of
K, and the number of processors being used. In the simulations considcrcd, the number of
processors communicating data is typically onc or two in each direction on scaled problems.
Shown in Figure 9 arc plots of time to convcrgcncc on different numbers of processors for
five different problems (fixed sim problems). “1’hc number of unknowns in (hc finite clcmcnt mesh
and the number of columns of C, arc indicated on tl~c plots. I’hc qllasi-Illi[li[~~lllll rcsiclual
algorithm was stopped when ttlc Ilormalizd residual was rc(tuced three orders of magnitude fc)r
each column of C. With an initial gUC,SS ~~.ing (]IC Z,cro VCCIOI-, ttlis r-csults in a no~naliz~d residual
29
ITERATIVE SOLUTION SCALING FOR FIXED SIZE PROBLEMS
— * .+– 43791
--* 1 6 8 4 8 9
-&- 271158
-—+— 417359
- u - 5 7 9 9 9 3--i
------7-—u -1--- —T--r——l~r———T———~t————~ ---=+‘T~0 64 128 192 256
Number of Processors
Figure 9. ‘1’imc of convergence for five different problems. ‘1’hc time
shown is the total execution time for the solver on different numbers of
processors. The C matrix had 116 columns in each case.
of 0.1 %, a value that is sufficient for this scattering problcm. Given a fixed communication
perccn(agc and a fixed rate for local work, doubling the number of processors for a given problcrn
would halve the total solution time. I’hc CLUVCS in Figure 9 do not drop lincady at this rate for
increasing numbers of processors, because there is a decrcasc in the amount of work pcr processor
while the amount of data communicated incrcascs, causing tbc CUIVCS to lCVC1 off.
Another factor in (bc performance of the parallc] lnatrix-vcclor multiply is the pcrccntagc of
col~~l~~ll[~icatioI~. ‘]’his is related to the number of processors to the left and right (hat each processor
must communica(c. 1( is clear that running a fixed size problem on an increasing number of
pmccssors will gcncratc ii growing alnount of co)ll[llll[licatio[~. ‘]’hc anmun( of coil~fll~lr~icatioll is a
30
fLIIIC(bn of how finely the K matrix is dccomposcd, since its maximum row bandwidth after
reordering is not a function of [he number of processors used in the decomposition. If the
maximum row bandwidth is m and each processor in a given d~.opposition has approximately m
rows of K, then most processors wi II require one processor in each dir~tion for communication
If the number of processors used for the distribution of K is doubled, each processor will have.
approximatcl y m/2 rows of K. Since the row bandwidth doesn’ t change, each processor will now
require communication in each direxlion from two processors. But since the number of floating
point operations required hasn’ t changed, the communication percentage should roughly double.
This can be seen in Figure 10, which shows communication percentage versus number of
processors, for four problem sizes.
31
4 5 -{ ‘--r”-’–—--’ “--–
40 -j——r
——. ._ ._-i 43791
: H-- - - - 166489“S 35 ———0.- -+ 271158~ 30E + 5799930 —-—
/-”--””-2—— .- ___. —— ____ ——_
—.— ——.
025
E- ~<5~
— ..&— ——. ——_0 —.
$20+. — — — . _ _ ——_ .5$15 z.. — .+” ‘
CL----. ...-., .-.
10 .... ””— . —— ....@’&— —.— .
5 3 -—-–---—--. J___ I j116 32 64 128
Number of Processors
——=____
z—-—-— .——___f
——.
——-------
— .__.. -1
256
I:igurc 10. Percentage of corn lnurlication versus number of processors for parallel
matrix vextor multiply, for four different size (number of edges) meshes of
dielectric cylinder.
The row slab dccompositiorl is a simple means for breaking the sparse matrix equation
among the processors and while the mcsll dcconlpositiorl algorithms outlined above can also be
used, diffcrcn~s between the approaches in time to solution on a parallel computer were found to
be small for either approacl~. Two alternative mesh dccompc)sition schemes have been compared to
the matrix partitioning algoritl~r]l, contrasting da[a load balance, communication load balance, the
total amount of coIlllllllIlicatioll and the IErforlllallcc of [lIC l o c a l p r o c e s s o r matrix-vwt(~r
Pcrfornlarlce resulting from [l]c specific dccolnl)ositior] used.The first is an algorithm termed
JOSrl’l .1; [S5] tha( uses various optin~iz,atiorl methods to c,,ualizc the mesh par-titioris among the
J)l”Occssors, The second is a mlllti-lcvc] graph pal~itiol~illg schcn)c (crmcd Mlrl’IS [42]. Among
32
the three approaches, no disccrnablc diffcrcncc was follnd in data and Comm~lnication Ioa(i ba]ancc,
and in the performance of the local processor ma[ri x- vcc.tor performance. A difference was found
in the total amount of communciation necdcct in t} IC solution of tllc sparse systcm of cquat ions.
When normali~.ing t}w total amount of communication in lhc matrix partition algorithm to 1.0, the
JOSTI .Ii algorithm reduced the communication overhead to 0.26, and the METIS algorithm
reduced it to 0.22. From Figure 10, it is noted that the pcrccntagc of communication time in the
complctc solver is 8% for scaled-sized problems (those that fit into lbc minimal number of
processors needed to SOIVC the problcm). It is this fraction of the total CPU time that can be
rcduccd by the 0.26 and 0.22 fractions found using the mesh decomposition algorithms; i .c. the
total time to SOIVC the systcrn would be reduced by just over 6% using the METIS algorithm for
mesh decomposition. It was found that the M1-iTIS and JOSTI.11 algorithms did produce Icss
communication overhead as tbc flxcd size problcm was solved on larger numbers of processors,
thereby further reducing total cxcction time. This savings over the matrix partitioning rncthod is
offset though, since the overall cxecut ion time ctccrcascs dramatically y as seen in I ‘igurc 9 for a
fixed siz.c problcm.
Krylov subspacc methods different from the quasi-minimum residual algorithm can be
coupled with a mesh or matrix decomposition method and used for sparse matrix solution. In [4’7]
the conjugate gradient squared and gcncralizcd minimum residual method arc used with geometric
partitioning algorithms and then compared. ‘llc Krylov iterative rncthod implementations are
necessarily similar since the dominant component of the solver is the matrix-vector multiply.
Parallel spccdup for fixed sizcct problems arc reported in [47] for the conjugate gradient squared
and the generalized minimum rcsiduai method. “1’hc speedups arc very similar to those shown in
I:igurc 9.
A possib]c means to substantially shorten the solution time. in an iterative solution is ttlc usc
of an cffcctivc prcconditioncr. “1’hc usc of incomplc[c Cholcsky prcconditioncrs used in sequential
calculations is difficult to implement in a distributed n~cmory parallel environment duc to the ncmi
for performing a forward and backward solution with each matrix multiply step il] ( 16). On a
33
parallel machine, these arc essentially sequential operations that can give greatly reduced
Pcrforinancc [53]. A promising altcrnativcis tocalculatcan approximation to the inverse of t}lc
Systcm, rather than a fac(oriz,ation of the system as is done in the incomplete Cho]csky
approximation. A sparse approximate inverse [54] produces a matrix with a controllable number
of non-zeros that approximates the inverse of K, and rather than calculating forward and backward
solutions, it multiplies K at each step of the i[cralivc algorithm. ‘1’hc matrix-matrix mLdtiply can be
achieved with much higher performance than the forward and backward solutions used in the
incomplclc factorization.
S. DISCUSSION
This paper presented an overview of solutions to surface integral equation and volumetric
finite clcmcnt methods on sequential and distributed memory computer architectures. Both the
sequcn(ia] algorithmic scalability as well as scalabihy on parallel computer sys(cms were pre.scntcd
for curlcnt computer technology, with extrapolation to next generation tcchno]ogics. A broad SCL
of rcfcrcnccs arc given. When a uniform resource locator ([JRI.) is also rcfcrcnccd, it points to
software which is freely available.
REFERENCES
[1] Ii. Miller, “A selective survey of computational clcctromagnctics:’ lE;E1~ 7’rans. Anfennas
Propag. AP-36, pp. 1281-1305, 1988.
[2] M. N. O. Sadiku, Numerical Techniques in Mectromagnetics, Boca Raton: CRC Press, 1992.
[3] J. J. IJongarra, L. Grandinctti, G. R . Joubcrt, J . Kowalik, I:ditors, }ligh Perjiinnmce
Computing: Tec}znology, Methods and Applications, Advances in Parallel Computing 10,
Amsterdam: Elscvicr, 1995.
[4] 1 i. Miller, “So]ving bigger problems- by dczrcasing [}IC q~ration collnt and inclcasing the
computation bandwidth,” }’roc. IEEE, vol. 79, No. 10, pp. 1493-1504, 1991.
34
[5] V. Kumar, A. Granla, A. (; Llpta and G. Karypis, ln!roduction to Parallel C~mputi~lg,
Redwood City: ‘1’hc IJc!ljaIllirti~ lllllrllillgs hb]ishing Company, 1994. See also
tlttI>://www.cs. tl1l~rl.cdLl/-kllr~lar.
[6] “1”. Cwik and J. Patterson, Jiditors, Computational Electronics and Supercomputer Architec(14rc,
Progress in I~lcclron~agnctics Research, Vol. 7, Cambridge: EMW PLlblishing, 1993.
[7] R. l:. IIarrington, FieM Computation by Moment Method. New York: Macmillan, 1968.
[8] A. J. Poggio and E. K. Miller, “lntcgral equation solutions of three-dimensional scattering
problems,” in Computer Techniql(es for Elec?romagnetics, R. Mittra, Editor, Chapter 4, Ncw
York: IIernispherc, 1973.
[9] G. Golub and C. Van ban, “Ma(r-ix Compufafions, ” The John Hopkins University Press,
1989.
[10] E Anderson, Z. Bai, J. Dcmmcl, J. Dongarra, J. Du Croz, A. Grccnbaum, S, IIanm~arling,
A. McKcnncy, S. Ostrouchov, and I). Sorenson, LA}’ACK User’s Guide, %cicty for
]ndustrial and Applied Mathematics, l’hiladelphia. PA 1992. Scc also
http://nct lib2.cs.utk.cdu/ lapack/index.ht ml
[11 ] E. Yip and B. Dcmbzul, “h40nostatic calculations for a 3D MOM code with fast mutipolc
method,” l’rogrcss in Elcctrornagnctics Research Symposium Proceedings, p. 53, Jul y 24-
28, 1995.
[ 12] J. Rahola, “Solution of dense systems of linear equations in the discrctc-dipole
approximation, ” SIAM .). Sci. Comput., Vol. 17, No. 1, pp. 78-89, 1996.
[13 ] M. Sanccr, R. McClary, K. Glovcr, “Elcctromagnctic computation using parametric
geometry,” Elcctrornagnc(ics, Vol. 10, No. 1-2, pp. 85-104, 1990.
[ 14] l;, X. Canning, “The lmpcdancc Matrix Imcaliz,ation (lMl.) Method for Mon~cnt-Method
Calculations,” IEEE Anfcnnas }’ropag. &fag., pp. 18-30, October 1990.
[ 15] W. Chew, C. 1.u and Y. Wang, “I; fficicnt computation of three-dimensional scattering of
vector clcctromagnctic waves,” J. C@t. S’(K. Am. A, Vo]. 11, No. 4, pp. 1528-1537, 1994.
35
[ 16] V. Rokhlin, “Rapid Solution of integral equations of sca~lcring thcorY in two dimensions,” J.
Comp. Physics 86, pp. 414-439, 1990.
[ 17] R. Coifman, V. Rokhlin and S. Wandm-a, “’1’hc fast mu]tipolc method for the wave equation:
a pedestrian prescription,” IEEE An!. Pr-opag. Msg. 35, pp. 7-12, 1993.
[ 18] M. A. Stalzer, “A parallel fast multipolc method for the I lclmholtz cquat i on,” Parallel
l’recessing Letters 5, pp. 263-2’14, 1995.
[19] S. M. Rae, D. R. Wilton, and A.. W. Glisson, “Electron~agnctic scattering by surfaces of
arbitrary shape,” IEEE Trans. Antennas Propag. AP-30, pp. 409-418, 1982.
[20] W. Johnson, D. R. Wilton, and R. M. Sharpc, “Modeling scattering from and radiation by
arbitrary shaped objects with the electric fielcl integral equation triangular surface patch COCIC,”
fidcctromagnelics 10, pp. 41-64, 1990.
[21 ] J. J. IIongarra, J. Bunch, C. Molcr, and G. W. Stewart, LIIVPACK User’s Guide, Society
for Industrial and Applied Mathematics, Philadelphia, PA 1979. Scc a lso
l~tt1>://I~ctlib2, cs.utk.cdtl/lill]~ ack/ir~(lcx.lltrl~l.
[22] J. J. Dongarra, J. Du Croz,, S. I lammarling, and R. J. I Ianson, An Extendexl set of
FORTRAN Basic Linear Algebra Subprograms, ACM Trans. Math. Soft., Vol. 14, pp. 1-
17, 1988. See also http: //nctlib2.cs.utk. cdu/blas/indcx. htn~l.
[23] J. J. Dongarra, J. Du Croz, 1. S. Duff, and S. IIammarling, A set of Level 3 Basic Linear
Algebra Subprograms, ACM 7’rans. Math. Soft., Vol. 16 pp. 1-17, 1990. SeC also
ht[p://nctlib2. cs.utk.cdu/blas/indcx .htnd.
[24] T. Cwik, R. van de Gcijn, and J. Patterson, “Application of massively parallel computation to
integral equation models of clcctromagnctic scattering,” J. Opt. Sot. Am. 11, pp. 1538-1545
(1994). See also ftp://microwavc.jpl .nasa.gov/pub/PARAI .I.I3I ~COMPLEX.SOI,VER.
[25] J. Jin, The Finite Element in l;lcctro~}lag~~c[ic.~, John Wiley and Sons, Inc., Ncw York, 1993.
[26] J. I,cc, and R. Mittra, “A Note (III I’hc Application Of l:kigc-Iilcmcnts Ijor Modeling 3-
l)imcnsional lnhon~ogcncously -l;illc(i Cavities,” IEEE Transactions On Microwovc 7hcory
and 7’echniqucs, vol. 40, no. 9, pp. 176”1- 1773, Sept., 1992.
36
[27] S. Pissanetzky, S@rse Matrix 7kchnolgy, 1.onclon: Academic Press, 1984.
[28] I. S. ]) Llff, A. M. lirisman and J. K. Reid, l~irec[ Method.r for Sparse Matrices, New York:
Oxford University Press, 1986.
[29] l;. Ng, and B. Pcyton, “Block sparse Cholesky Algorithms on Advanced Uniprocessor
Con~pL]tcrs,” SIAM J. Sci. Conqmt., vol. 14, No. 5, j)jl. 1034-1056, I)p. 1034-1056, 1993.
[30] R. Barrett, M. Berry, T. Chan, J. Dcmmcl, J. Ilonato, J. Dongarra, V. l;ijkhout, R. POZO, C.
Rominc, and H. van dcr Vorst, Tmplotesfor the .volution of linmr system: building blocks
for iterative methods, Society for Industrial and Applied Mathcrnatics, Philadelphia. PA
1994. Scc also http: //netlib2.cs.utk.du/ten~ platcs/index.l~tll~l.
[311 M. Jones and P. Plassmann, “An improved incomplete Cholcsky Factoriz,a(ion, ” ACM Trans.
Math. Soft ware, Vo]. 21, No. 1, pp. 5-17, 1995. S(YJ also
http: //nctlib2.cs.utk. edu/toms~4O.
[32] I). 0’1 ,cary, “Parallel inlplcmcnta(ion of the block conjugate algorithm~’ Parallel Conlpl/t.,
Vol. 5, pp. 127-139, 1987
[33] R. l~rcund, and M. Malhotra, “A block-QMR algorithm for non-hcrlnitian linear systems with
multiple right-hand sides,” preprint 1996.
[34] R. Mittra, (). Rarnahi, A. Khebir, R. Gordon, and A. Kouki, “A Review of Absorbing
Boundary Conditions for Two- and Three-Dirncnsional Electromagnetic Scattering
Problems;’ I1~EE Trans. Mag, VOL 25, no. 7, pp. 3034--3040, July 1989.
[35] J.-M. Jin and V. I,icpa, “Application of ~~ybrid Finite IXmcnt Mc[hod to Elcctromagnctic
Sca(tcring from Coated Cylinders,” lEEE Tram. Antennas and Propagation, vol. AP-36,
pp. 50--54, Jan. 1988.
[36] X. Yuan, I). l,ynch, and J. Strohbchn, “Coupling of finite clcmcnt and moment mchods for
Clcc[romagnctic scattering fronl inhomogenoLls objects,” Ifil<fi; Tmns. Anicnnas Propqgai,
vol. A1’-38, pp. 386–394, Mar. 1990.
137] W. Boysc and A. Scidl, “A 11 ybrid l~inite lilcmcnt Method for Near lhdim of RcvolLltion,”
lfih’1< l’rans. Mag, vol. 27, l}p. 3833--3836, Scpt. 1991.
37
[38] ’1’. Cwik, C. Zuffada, and V. Janmejad, “Modeling ‘1’}lrW.-I>itllcnsiCJna! scatterers Usirlg a
Coupled Finite Elel~lctlt--Ir~tcgral I~~ua(ion Representation,” lEEETran.v. Antennas l’ropag.,
AP44, pp. 453-459, 1996.
139] 11. NoLlr-On~id, A . Racfsky, a n d G. l.yzcnga, “solving ~~initc Illcmcnt I:~uations o n
Concurrent Computers,” American Sot. Mec}I. Eng., A. Noor Iiditor, pp. 291-307, 1986.
[40] A. Pothen, Ii. Simon and K. I.iou, “Partitioning Sparse Matrices with I;igcnvcztors of
Graphs,” SIAM ./. Matrix Anal. Appl., vol. 11, pp. 430-452, 1990.
[41 ] B. Hendrickson and R. IxAand, “An lmprovcd Spectral Graph Partitioning Algorithm for
Mapping Parallel Computations, “ SIAM J. Sci. Cotnput., vol. 16, pp. 452-469, 1995.
[42] G. Karypis and V. Kurnar, “A l~ast and Iligh Quality Multilevel Scheme for Partitioning
lrrcgular Graphs,” Technical Report 7X 95-035, Department of Computer Science,
University of Minnesota, 1995. See also http: //www.cs.tlnln. edti-ka~pis/r~lcti#rnetis.t~tr1~l.
[43] M. IIeath, E. Ng, and B. Pcyton, “Parallel algorithms for sparse linear systems,” SIAM
Review, Vol. 33, No. 3, pp. 420-460, 1991.
[44] 1;. Rothbcrg, “Alternatives for solving sparse triangular systems on distribLltcd-rllcll~(~~ mLllti-
proccssors,” Parallel Cornput., Vol. 21, pp. 1 1 2 1 - 1 1 3 6 , 1 9 9 5 . See a l so
http: //www.ssd.intcl .corn/appsw/scs .html.
[45] A. Gupta, G. Karypis, and V. Kumar, “Highly scalable parallel algorithms for sparse matrix
factorization:’ Technical Report 7’R 94-63, Department of Computer Science, University of
Mirmcso[a, 1994
[46] G. }Icnnigan, W. Dcarholt, S. Castillo, “The finite clement solutionof elliptic systems on a
massivcl y parallel cornputcr using a direct solver,” preprint.
[47] J. Shadid and R. Tuminaro, “Sparse iterative algorithm software for large-scale MIMD
machines: an initial discussion and inqicmcntation, >’ Concurrency: Practice muf lixpericncc,
Vol. 4, pp. 481-497, 1992. Scc also tlttp://www.cs. sar~dia.gov/l ll'CCI"l/az.tcc. t~tl~~l.
38
[48] M. Jones and P. Plassman, “Scalable iterative solutions of sparse Iincar systems,” Para//e/
Comput., vol. 20, pp.753-773, 1994. Scc also
http://www,n~cs.anl .govfi~on~c/frcita~SC94d cll~o/softwarchlwksolvc. }~tl~~l.
[49] ‘I’. Cwik, D. Katz, C. Zuffada, and V. Jamncjad, “’I’llC Application of Scalable IJistributcd
Memory Computers to the F’initc Element Modeling of Elcctromagnctic Scattering,”
submitted to Intl. J. Numer. Methods Engr., 1996.
[50] A. George and J. Liu, Computer Solution of I.arge Sparse Positive l)ejinite Systems, Prcnticc
IIall, Ncw Jersey, 1981.
[51] J. l~wis, “Implementation Of The Gibbs- Poole-Stockn~cycr And Gibbs-King Algorithms,”
ACM Trans. on Moth. Software, VO]. 8, pp. 180-189, 1982.
http: //nctlib2.cs.utk. edu/toms/582.
[52 ] R, Freund, “Conjugate Gradient-Type Methods for I.inear Systems with Complex
Symmetric Cocfficicnt Matrices,” SIAM J. Stat. C’omput, vol. 13, no. 1, pp. 425-448, Jan,
1992. Scc also http: //nctlib2.cs.utk. cdu/linalg/lalqmr.
[53] E. Rothbcrg and A. Gupta, “Parallel ICCG on a hierarchical memory multiprocessor--
addressing the triangular solve bottleneck,” Parallel Comput., Vol. 18, pp. 719-741, 1992.
[54] M. J. Grote and Thomas IIuckle, “Parallel Preconditioning with Sparse Approximate
Inverses, accepted for publication in SIAM 3. oj Scientific Compu/ing, 1996. See also
l~ttp://www-sccm. stanford.cdtiStudcnts/grotc.}ltrlll.
[55] C. 11. Walshaw, M. Cross, and M. G, IIvcrett, “A localimd algorithm for optimizing
unstructured mesh partitions,” international Journal Of Supercomputer Applications And
nigh Performance Computing, Vol 9, No. 4, pp.280-295, 1995.
39
PERFORMANCE OF MATRIX SOLUTION ALGORITHM
1 2 0 -—z
T
rlOO+----
–——
60-
40
20-
01 1
Cray I“3D—— . ..- .. ____
-m-===---==’=
/
15120
560 ,
10 1
———-_
+
30240——
00 1(Number of Processors
——-— .
4‘ 0480
(es imate)
3 100003.1
IOgurel. Perfomlmce for BSOLVH(includcs factoring matrix, estin~ating condition nunlkr, and
solving for one right hand side.)
40
TIME FOR LU FACTORIZATION
100
10
0.1
0.01
4
+
I , 1
0 :
I 512 pE’s
POoc .—
4Cx5c
I_—. —.-——-f---l---r —
40 c
c
64 Bit Complex Arithmetic, Partial Pivoting—
T
I
APEs
1—
o196 PEs
Ooc
c
~—
❑ T3D
~ Paragon
1
A T90 (32 (2PUS)
● C90 (16 CPUS)
[~ Delta~—-
!0UNKNOWNS (1000)
Figure 2. CPU time vs. number of unknowns for 64 bit complex dense direct solvers with
partial pivoting for various machines (Cray T3D, T90 and C90; Intel Paragon and Delta).
processing elements (PIk) used in calculation shown for parallel computers. In-core and
out-of-core (OOC) rcsuhs specified.
41
1 0 0 ,
1 0 -
~&
1-i!.-t -
0.1 -
0.01 1 1 10
PATCH SCAL.ING FOR SCALED SIZE PROBLEMS
——
T..-—
=J4---4 6
— T - - r - r
X PE
‘JA+”——
——
10
---. ._. ,y
1 1 r-
MATRIX DIMENSION (X
Figure 3. Scalability for PATCII code for scaled size problems.
—.
12 141 000)
25[
1 I I
‘Es
71I
18
42
PATCH SCALING FOR FIXED SIZE PROBLEMS
301i== L =-lcra2 5 - - + 5832
_ - t
+ 6 2 0 0
+ 8712
~ 2“ —g A
5 -
0
———- —.
l“3t3.—
.——
.— _
“ — — - - - - - t - - - - - - - -128
Number of F’rocessors192
1
6
liigurc 4. Scalability for PATCI{ code for fixed siz,e problems. The 16,200 unknown case was
only run on 256 processors.
43
Dn—--–————’~—’——
j,.t-..c.
U? ’”” “: i ~.. . .!y,. .: .
L . ,,.
—-_ —-.--AAI~igurc 5. Non-zero matrix struc[urc of typjcal finite clement simulation. Left, original structure:
right, structure after reordering to mjnimizc bandwidth.
44
I
. . . .:,~., : ,:.l <.x! . . ,
.,
.:-~..; . :
. .,. : -c” .
.. ,1-.:., .
,,.: ,, ,-.7
. . ” .,, . .,:< :.;.. ;,. ~
., .<.- . -2 ,,> -“-, ,., . . . . . .*, .,. .,.. i- -.. :. . < ‘.” :.. ,
,1
~>
.-A .,F ..:.,.
. .:.
,,
;: . fi’’” .:... 8... . . .? .’ , ::-.: . “.. . , ; -*.. . $.
: :. i“. — — _
Figure 6. Non-zero matrix sparsity structure for system wilh dense surface integral equation
boundaly condition applied (left), and surface of revolution integral equation boundary
condition (right). The mesh has 5,343 edges with 936 of those on the boundary.
45
1 6 0 . 0 0 - /-
140.00I
120.001
~ 100.00a): 80.00
~ 6 0 . 0 0 1
40.002-1
20.00;
SCALING OF SPARSE SLICE ALGORITHM
——I
■ WRITE
❑ COUPLING
❑ FEM
II ~ CONNECT
❑ READ
.1-
T90-8”16” 3 2 - 6 4 “ 1 2 8 ’Number of Processors
256
Iiigure 7. Computation time and scaling for a relatively small simulation (dielectric cylinder with
43,791 edges, radius= 1 cm, height== 10 cm, permittivity = 4.0 at 5.0 GHz). Pirst column
shows time for single processor T’90. Times on T90 for CONNECr and FEM have been
combined.
46
g -~Y!g!f&A
a . 1COLUMNS ‘
x
COMMUNICATION FROM, PROCESSOR TO LEFT
IId
,.
~1x
=Vx
\
\
LOCAL PROCESSOR ROWS
\
LOCAL PROCESSOR ROWS
‘ COMMUNICATION FROMPROCESSOR TO RIGHT
Figure 8. Local sparse matrix-dense vector multiply graphically displayed.
47
ITERATIVE SOL. UTION SCALING FOR FIXED SIZE PROBLEMS
140
E-::1. [~,~
— . .—.+ 43791
120 + 166489
& 271158100 — —.—
~ --4- 4 1 7 3 5 9g
+ 579993: 80
.—Fc . ..=--375“’ 40
-- I~u t---k--i —
d-———~l.~128
Number of Processors
1 —\ I I 1
0 64 192 2i6
Figure 9. ‘1’imc of convergence for five different probkxns. l’he time shown is the total execution
time for the solver on different numbers of processors. The C matrix had 116 columns in
each case.
48
45 --$——————————I
40 -+––––– -–w–
–*–
-+–
4“
43791
166489
271158
579993_—
5~
16 32
—
P—..
——
—.. -
/_-—
—-
64
Number of Processors
128 256
Figure 10. Pcrcentagc of eornmunication versus number of processors for parallel
matrix vector multiply, for four different size (number of edges) meshes of diclectic
cylinder.
49
4 5 + . – - - - - - – - r - — ’
40L—.....—.
–.~ .—-*– 43791
+– 166489
+ 271158
-+- 579993
-~—
5~
16 32
7’=_———
64
I
128 256
Number of Processors
Figure 10. Percentage of communication versus number of processors for parallel
matrix vector multiply, for four different siz~ (number of edges) meshes of dielwtric
cylinder.
49