Fast optimal load balancing algorithms for 1D partitioning

23
J. Parallel Distrib. Comput. 64 (2004) 974–996 Fast optimal load balancing algorithms for 1D partitioning $ Ali Pınar a,1 and Cevdet Aykanat b, * a Computational Research Division, Lawrence Berkeley National Laboratory, USA b Department of Computer Engineering, Bilkent University, Ankara 06533, Turkey Received 30 March 2000; revised 5 May 2004 Abstract The one-dimensional decomposition of nonuniform workload arrays with optimal load balancing is investigated. The problem has been studied in the literature as the ‘‘chains-on-chains partitioning’’ problem. Despite the rich literature on exact algorithms, heuristics are still used in parallel computing community with the ‘‘hope’’ of good decompositions and the ‘‘myth’’ of exact algorithms being hard to implement and not runtime efficient. We show that exact algorithms yield significant improvements in load balance over heuristics with negligible overhead. Detailed pseudocodes of the proposed algorithms are provided for reproducibility. We start with a literature review and propose improvements and efficient implementation tips for these algorithms. We also introduce novel algorithms that are asymptotically and runtime efficient. Our experiments on sparse matrix and direct volume rendering datasets verify that balance can be significantly improved by using exact algorithms. The proposed exact algorithms are 100 times faster than a single sparse-matrix vector multiplication for 64-way decompositions on the average. We conclude that exact algorithms with proposed efficient implementations can effectively replace heuristics. r 2004 Elsevier Inc. All rights reserved. Keywords: One-dimensional partitioning; Optimal load balancing; Chains-on-chains partitioning; Dynamic programming; Iterative refinement; Parametric search; Parallel sparse matrix vector multiplication; Image-space parallel volume rendering 1. Introduction This article investigates block partitioning of possibly multi-dimensional nonuniform domains over one-di- mensional (1D) workload arrays. Communication and synchronization overhead is assumed to be handled implicitly by proper selection of ordering and parallel computation schemes at the beginning so that load balance is the only metric explicitly considered for decomposition. The load balancing problem in the partitioning can be modeled as the chains-on-chains partitioning (CCP) problem with nonnegative task weights and unweighted edges between successive tasks. The objective of the CCP problem is to find a sequence of P 1 separators to divide a chain of N tasks with associated computational weights into P consecutive parts so that the bottleneck value, i.e., maximum load among all processors, is minimized. The first polynomial time algorithm for the CCP problem was proposed by Bokhari [4]. Bokhari’s OðN 3 PÞ-time algorithm is based on finding a minimum path on a layered graph. Nicol and O’Hallaron [28] reduced the complexity to OðN 2 PÞ by decreasing the number of edges in the layered graph. Algorithm paradigms used in following studies can be classified as dynamic programming (DP), iterative refinement, and parametric search. Anily and Federgruen [1] initiated the DP approach with an OðN 2 PÞ-time algorithm. Hansen and Lih [13] independently proposed an OðN 2 PÞ-time algorithm. Choi and Narahari [6], and Olstad and Manne [30] introduced asymptotically faster OðNPÞ-time, and OððN PÞPÞ-time DP-based algorithms, respectively. The iterative refinement ap- proach starts with a partition and iteratively tries to improve the solution. The OððN PÞP log PÞ-time ARTICLE IN PRESS $ This work is partially supported by The Scientific and Technical Research Council of Turkey under grant EEEAG-103E028. *Corresponding author. Fax: +90-312-2664047. E-mail addresses: [email protected] (A. Pınar), [email protected]. edu.tr (C. Aykanat). 1 Supported by the Director, Office of Science, Division of Mathematical, Information, and Computational Sciences of the US Department of Energy under contract DE-AC03-76SF00098. One Cyclotron Road MS 50F, Berkeley, CA 94720. 0743-7315/$ - see front matter r 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2004.05.003

Transcript of Fast optimal load balancing algorithms for 1D partitioning

J Parallel Distrib Comput 64 (2004) 974ndash996

ARTICLE IN PRESS

$This work

Research Coun

Correspond

E-mail addr

edutr (C Ayka1Supported

Mathematical

Department of

Cyclotron Roa

0743-7315$ - se

doi101016jjp

Fast optimal load balancing algorithms for 1D partitioning$

Ali Pınara1 and Cevdet AykanatbaComputational Research Division Lawrence Berkeley National Laboratory USAbDepartment of Computer Engineering Bilkent University Ankara 06533 Turkey

Received 30 March 2000 revised 5 May 2004

Abstract

The one-dimensional decomposition of nonuniform workload arrays with optimal load balancing is investigated The problem

has been studied in the literature as the lsquolsquochains-on-chains partitioningrsquorsquo problem Despite the rich literature on exact algorithms

heuristics are still used in parallel computing community with the lsquolsquohopersquorsquo of good decompositions and the lsquolsquomythrsquorsquo of exact

algorithms being hard to implement and not runtime efficient We show that exact algorithms yield significant improvements in load

balance over heuristics with negligible overhead Detailed pseudocodes of the proposed algorithms are provided for reproducibility

We start with a literature review and propose improvements and efficient implementation tips for these algorithms We also

introduce novel algorithms that are asymptotically and runtime efficient Our experiments on sparse matrix and direct volume

rendering datasets verify that balance can be significantly improved by using exact algorithms The proposed exact algorithms are

100 times faster than a single sparse-matrix vector multiplication for 64-way decompositions on the average We conclude that exact

algorithms with proposed efficient implementations can effectively replace heuristics

r 2004 Elsevier Inc All rights reserved

Keywords One-dimensional partitioning Optimal load balancing Chains-on-chains partitioning Dynamic programming Iterative refinement

Parametric search Parallel sparse matrix vector multiplication Image-space parallel volume rendering

1 Introduction

This article investigates block partitioning of possiblymulti-dimensional nonuniform domains over one-di-mensional (1D) workload arrays Communication andsynchronization overhead is assumed to be handledimplicitly by proper selection of ordering and parallelcomputation schemes at the beginning so that loadbalance is the only metric explicitly considered fordecomposition The load balancing problem in thepartitioning can be modeled as the chains-on-chains

partitioning (CCP) problem with nonnegative taskweights and unweighted edges between successive tasks

is partially supported by The Scientific and Technical

cil of Turkey under grant EEEAG-103E028

ing author Fax +90-312-2664047

esses apinarlblgov (A Pınar) aykanatcsbilkent

nat)

by the Director Office of Science Division of

Information and Computational Sciences of the US

Energy under contract DE-AC03-76SF00098 One

d MS 50F Berkeley CA 94720

e front matter r 2004 Elsevier Inc All rights reserved

dc200405003

The objective of the CCP problem is to find a sequenceof P 1 separators to divide a chain of N tasks withassociated computational weights into P consecutiveparts so that the bottleneck value ie maximum loadamong all processors is minimizedThe first polynomial time algorithm for the CCP

problem was proposed by Bokhari [4] BokharirsquosOethN3PTHORN-time algorithm is based on finding a minimumpath on a layered graph Nicol and OrsquoHallaron [28]reduced the complexity to OethN2PTHORN by decreasing thenumber of edges in the layered graph Algorithmparadigms used in following studies can be classifiedas dynamic programming (DP) iterative refinement andparametric search Anily and Federgruen [1] initiatedthe DP approach with an OethN2PTHORN-time algorithmHansen and Lih [13] independently proposed anOethN2PTHORN-time algorithm Choi and Narahari [6] andOlstad and Manne [30] introduced asymptoticallyfaster OethNPTHORN-time and OethethN PTHORNPTHORN-time DP-basedalgorithms respectively The iterative refinement ap-proach starts with a partition and iteratively tries toimprove the solution The OethethN PTHORNP log PTHORN-time

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 975

algorithm proposed by Manne and S^revik [23] fallsinto this class The parametric-search approach relies onrepeatedly probing for a partition with a bottleneck

value no greater than a given value Complexity ofprobing is yethNTHORN since each task has to be examined butcan be reduced to OethP log NTHORN through binary searchafter an initial yethNTHORN-time prefix-sum operation on thetask chain [18] Later the complexity was reduced toOethP logethN=PTHORNTHORN by Han et al [12]The parametric-search approach goes back to Iqbalrsquos

work [16] describing an e-approximate algorithm whichperforms OethlogethWtot=eTHORNTHORN probe calls Here Wtot denotesthe total task weight and e40 denotes the desiredaccuracy Iqbalrsquos algorithm exploits the observation thatthe bottleneck value is in the range frac12Wtot=PWtot andperforms binary search in this range by makingOethlogethWtot=eTHORNTHORN probes This work was followed byseveral exact algorithms involving efficient schemes forsearching over bottleneck values by considering onlysubchain weights Nicol and OrsquoHallaron [2829] pro-posed a search scheme that requires at most 4N probesIqbal and Bokhari [17] relaxed the restriction of thisalgorithm on bounded task weight and communicationcost by proposing a condensation algorithm Iqbal [15]and Nicol [2729] concurrently proposed an efficientsearch scheme that finds an optimal partition after onlyOethP log NTHORN probes Asymptotically more efficient algo-rithms were proposed by Frederickson [78] and Hanet al [12] Frederickson proposed an OethNTHORN-time optimalalgorithm Han et al proposed a recursive algorithmwith complexity OethN thorn P1thornETHORN for any small E40 Thesetwo studies have focused on asymptotic complexitydisregarding practiceDespite these efforts heuristics are still commonly

used in the parallel computing community and designof efficient heuristics is still an active area of research[24] The reasons for preferring heuristics are ease ofimplementation efficiency the expectation that heur-istics yield good partitions and the misconception thatexact algorithms are not affordable as a preprocessingstep for efficient parallelization By contrast our workproposes efficient exact CCP algorithms Implementa-tion details and pseudocodes for proposed algorithmsare presented for clarity and reproducibility We alsodemonstrate that qualities of the decompositions ob-tained through heuristics differ substantially from thoseof optimal ones through experiments on a wide range ofreal-world problemsFor the runtime efficiency of our algorithms we use

an effective heuristic as a preprocessing step to find agood upper bound on the optimal bottleneck valueThen we exploit lower and upper bounds on the optimalbottleneck value to restrict the search space forseparator-index values This separator-index boundingscheme is exploited in a static manner in the DPalgorithm drastically reducing the number of table

entries computed and referenced A dynamic separator-index bounding scheme is proposed for parametricsearch algorithms narrowing separator-index rangesafter each probe The upper bound on the optimalbottleneck value is also exploited to find a much betterinitial partition for the iterative-refinement algorithmproposed by Manne and S^revik [23] We also propose adifferent iterative-refinement technique which is veryfast for small-to-medium number of processors Ob-servations behind this algorithm are further used toincorporate the subchain-weight concept into Iqbalrsquos[16] approximate bisection algorithm to make it an exactalgorithmTwo applications are investigated in our experiments

sparse matrixndashvector multiplication (SpMxV) which isone of the most important kernels in scientific comput-ing and image-space parallel volume rendering which iswidely used for scientific visualization Integer and realvalued workload arrays arising in these two applicationsare their distinctive features Furthermore SpMxV afine-grain application demonstrates the feasibility ofusing optimal load balancing algorithms even in sparse-matrix decomposition Experiments with proposed CCPalgorithms on a wide range of sparse test matrices showthat 64-way decompositions can be computed 100 timesfaster than a single SpMxV time while reducing the loadimbalance by a factor of four over the most effectiveheuristic Experimental results on volume renderingdataset show that exact algorithms can produce 38times better 64-way decompositions than the mosteffective heuristic while being only 11 percent sloweron averageThe remainder of this article is organized as follows

Table 1 displays the notation used in the paper Section 2defines the CCP problem A survey of existing CCPalgorithms is presented in Section 3 Proposed CCPalgorithms are discussed in Section 4 Load-balancingapplications used in our experiments are described inSection 5 and performance results are discussed inSection 6

2 Chains-on-chains partitioning (CCP) problem

In the CCP problem a computational problemdecomposed into a chain T frac14 t1 t2y tNS of N

taskmodules with associated positive computationalweights W frac14 w1w2ywNS is to be mapped ontoa chain P frac14 P1P2yPPS of P homogeneous

processors It is worth noting that there are noprecedence constraints among the tasks in the chain Asubchain of T is defined as a subset of contiguous tasksand the subchain consisting of tasks ti tithorn1y tjS isdenoted as T ij The computational load Wij ofsubchain T ij is Wij frac14

Pjhfrac14i wh From the contiguity

constraint a partition P should map contiguous

ARTICLE IN PRESS

Table 1

The summary of important abbreviations and symbols

Notation Explanation

N number of tasks

P number of processors

P processor chain

Pi ith processor in the processor chain

T task chain ie T frac14 t1 t2y tNSti ith task in the task chain

T ij subchain of tasks starting from ti upto tj ie T ij frac14ti tithorn1y tjS

T pi subproblem of p-way partitioning of the first i tasks in

the task chain T

wi computational load of task ti

wmax maximum computational load among all tasks

wavg average computational load of all tasks

Wij total computational load of task subchain Tij

Wtot total computational load

Wfrac12i total weight of the first i tasks

Ppi partition of first i tasks in the task chain onto the first p

processors in the processor chain

Lp load of the pth processor in a partition

UB upperbound on the value of an optimal solution

LB lower bound on the value of an optimal solution

B13 ideal bottleneck value achieved when all processors

have equal load

Bpi optimal solution value for p-way partitioning of the

first i tasks

sp index of the last task assigned to the pth processor

SLp lowest position for the pth separator index in an

optimal solution

SHp highest position for the pth separator index in an

optimal solution

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996976

subchains to contiguous processors Hence a P-waychain-partition PP

N of a task chain T with N tasks ontoa processor chain P with P processors is described by asequence PP

N frac14 s0 s1 s2y sPS of P thorn 1 separatorindices where s0 frac14 0ps1ppsP frac14 N Here sp de-notes the index of the last task of the pth part so that Pp

gets the subchain T sp1thorn1spwith load Lp frac14 Wsp1thorn1sp

The cost CethPTHORN of a partition P is determined by themaximum processor execution time among all proces-sors ie CethPTHORN frac14 B frac14 max1pppP fLpg This B value of apartition is called its bottleneck value and the processorpart defining it is called the bottleneck processorpartThe CCP problem can be defined as finding a mappingPopt that minimizes the bottleneck value Bopt frac14 CethPoptTHORN

3 Previous work

Each CCP algorithm discussed in this section andSection 4 involves an initial prefix-sum operation on thetask-weight array W to enhance the efficiency ofsubsequent subchain-weight computations The prefix-sum operation replaces the ith entry Wfrac12i with the sumof the first i entries eth

Pihfrac141 whTHORN so that computational

load Wij of a subchain T ij can be efficiently determined

as Wfrac12 j Wfrac12i 1 in Oeth1THORN-time In our discussions Wis used to refer to the prefix-summed W-array and theyethNTHORN cost of this initial prefix-sum operation isconsidered in the complexity analysis The presentationsfocus only on finding the bottleneck value Bopt becausea corresponding optimal mapping can be constructedeasily by making a PROBEethBoptTHORN call as discussed inSection 34

31 Heuristics

The most commonly used partitioning heuristic isbased on recursive bisection ethRBTHORN RB achieves P-waypartitioning through log P bisection levels where P is apower of 2 At each bisection step in a level the currentchain is divided evenly into two subchains Althoughoptimal division can be easily achieved at every bisectionstep the sequence of optimal bisections may lead topoor load balancing RB can be efficiently implementedin OethN thorn P log NTHORN time by first performing a prefix-sumoperation on the workload array W with complexityOethNTHORN and then making P 1 binary searches in theprefix-summed W-array each with complexityOethlog NTHORNMiguet and Pierson [24] proposed two other heur-

istics The first heuristic ethH1THORN computes the separatorvalues such that sp is the largest index such thatW1sp

ppB13 where B13 frac14 Wtot=P is the ideal bottleneckvalue and Wtot frac14

PNifrac141 wi denotes sum of all task

weights The second heuristic ethH2THORN further refines theseparator indices by incrementing each sp value found inH1 if ethW1spthorn1 pB13THORNoeth pB13 W1sp

THORN These two heur-istics can also be implemented in OethN thorn P log NTHORN timeby performing P 1 binary searches in the prefix-summed W-array Miguet and Pierson [24] have alreadyproved the upper bounds on the bottleneck values of thepartitions found by H1 and H2 as BH1BH2oB13 thornwmax where wmax frac14 max1pppN fwig denotes the max-imum task weight The following lemma establishes asimilar bound for the RB heuristic

Lemma 31 Let PRB frac14 s0 s1y sPS be an RB solu-

tion to a CCP problem ethWNPTHORN Then BRB frac14 CethPRBTHORNsatisfies BRBpB13 thorn wmaxethP 1THORN=P

Proof Consider the first bisection step There exists apivot index 1pi1pN such that both sides weigh lessthan Wtot=2 without the i1th task and more than Wtot=2with it That is

W1i11Wi1thorn1NpWtot=2pW1i1 Wi1N

The worst case for RB occurs when wi1 frac14 wmax andW1i11 frac14 Wi1thorn1N frac14 ethWtot wmaxTHORN=2 Without loss ofgenerality assume that ti1 is assigned to the left part sothat sP=2 frac14 i1 and W1sP=2

frac14 Wtot=2thorn wmax=2 In asimilar worst-case bisection of T 1sP=2

there exists an

ARTICLE IN PRESS

Fig 1 OethethN PTHORNPTHORN-time dynamic-programming algorithm proposed

by Choi and Narahari [6] and Olstad and Manne [30]

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 977

index i2 such that wi2 frac14 wmax and W1i21 frac14 Wi2thorn1sP=2frac14

ethWtot wmaxTHORN=4 and ti2 is assigned to the left part sothat sP=4 frac14 i2 and W1sP=4

frac14 ethWtot wmaxTHORN=4thorn wmax frac14Wtot=4thorn eth3=4THORNwmax For a sequence of log P such worst-case bisection steps on the left parts processor P1 will bethe bottleneck processor with load BRB frac14 W1s1 frac14Wtot=P thorn wmaxethP 1THORN=P amp

32 Dynamic programming

The overlapping subproblem space can be defined asT p

i for p frac14 1 2yP and i frac14 p p thorn 1yN P thorn pwhere T p

i denotes a p-way CCP of prefix task-subchainT 1i frac14 t1 t2y tiS onto prefix processor-subchainP1p frac14 P1P2yPpS Notice that index i is restrictedto ppipN P thorn p range because there is no merit inleaving a processor empty From this subproblem spacedefinition the optimal substructure property of the CCPproblem can be shown by considering an optimalmapping Pp

i frac14 s0 s1y sp frac14 iS with a bottleneckvalue B

pi for the CCP subproblem T p

i If the lastprocessor is not the bottleneck processor in Pp

i thenPp1

sp1frac14 s0 s1y sp1S should be an optimal mapping

for the subproblem T p1sp1

Hence recursive definition forthe bottleneck value of an optimal mapping is

Bpi frac14 min

p1pjoifmaxfB

p1j Wjthorn1igg eth1THORN

In (1) searching for index j corresponds to searching forseparator sp1 so that the remaining subchain T jthorn1i isassigned to the last processor Pp in an optimal mappingPp

i of T pi The bottleneck value BP

N of an optimalmapping can be computed using (1) in a bottom-upfashion starting from B1

i frac14 W1i for i frac14 1 2yN Aninitial prefix-sum on the workload array W enablesconstant-time computation of subchain weight of theform Wjthorn1i through Wjthorn1i frac14 Wfrac12i Wfrac12 j ComputingB

pi using (1) takes OethN pTHORN time for each i and p and

thus the algorithm takes OethethN PTHORN2PTHORN time since thenumber of distinct subproblems is equal to ethN P thorn 1THORNPChoi and Narahari [6] and Olstad and Manne [30]

reduced the complexity of this scheme to OethNPTHORN andOethethN PTHORNPTHORN respectively by exploiting the followingobservations that hold for positive task weights For afixed p in (1) the minimum index value j

pi defining B

pi

cannot occur at a value less than the minimum indexvalue j

pi1 defining B

pi1 ie j

pi1pj

pi pethi 1THORN Hence the

search for the optimal jpi can start from j

pi1 In (1) B

p1j

for a fixed p is a nondecreasing function of j and Wjthorn1ifor a fixed i is a decreasing function of j reducing to 0 atj frac14 i Thus two cases occur in a semi-closed intervalfrac12 j

pi1 iTHORN for j If Wjthorn1i4B

p1j initially then these two

functions intersect in frac12 jpi1 iTHORN In this case the search for

jpi continues until Wj13thorn1ipB

p1j and then only j13 and

j13 1 are considered for setting jpi with j

pi frac14 j13 if

Bp1j13 pWji and j

pi frac14 j13 1 otherwise Note that this

scheme automatically detects jpi frac14 i 1 if Wjthorn1i and

Bp1j intersect in the open interval ethi 1 iTHORN However if

Wjthorn1ipBp1j initially then B

p1j lies above Wjthorn1i in the

closed interval frac12 jpi1 i In this case the minimum value

occurs at the first value of j ie jpi frac14 j

pi1 These

improvements lead to an OethethN PTHORNPTHORN-time algorithmsince computation of all B

pi values for a fixed p makes

OethN PTHORN references to already computed Bp1j values

Fig 1 displays a run-time efficient implementation ofthis OethethN PTHORNPTHORN-time DP algorithm which avoids theexplicit minndashmax operation required in (1) In Fig 1 B

pi

values are stored in a table whose entries are computedin row-major order

33 Iterative refinement

The algorithm proposed by Manne and S^revik [23]referred to here as the MS algorithm finds a sequence ofnonoptimal partitions such that there is only one wayeach partition can be improved For this purpose theyintroduce the leftist partition (LP) Consider a partitionP such that Pp is the leftmost processor containing atleast two tasks P is defined as an LP if increasing theload of any processor Pc that lies to the right of Pp byaugmenting the last task of Pc1 to Pc makes Pc abottleneck processor with a load XCethPTHORN Let P be anLP with bottleneck processor Pb and bottleneck value BIf Pb contains only one task then P is optimal On theother hand assume that Pb contains at least two tasksThe refinement step which is shown by the inner whilendashloop in Fig 2 tries to find a new LP of lower cost bysuccessively removing the first task of Pp and augment-ing it to Pp1 for p frac14 b b 1y until LpoB Refine-ment fails when the whilendashloop proceeds until p frac14 1 withLpXB Manne and S^revik proved that a successfulrefinement of an LP gives a new LP and the LP isoptimal if the refinement fails They proposed using aninitial LP in which the P 1 leftmost processors each

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996978

has only one task and the last processor contains theremaining tasks The MS algorithm moves eachseparator index at most N P times so that the totalnumber of separator-index moves is OethPethN PTHORNTHORN Amax-heap is maintained for the processor loads to find abottleneck processor at the beginning of each repeat-loop iteration The cost of each separator-index move isOethlog P) since it necessitates one decrease-key and oneincrease-key operations Thus the complexity of the MSalgorithm is OethPethN PTHORN log PTHORN

34 Parametric search

The parametric-search approach relies on repeatedprobing for a partition P with a bottleneck value no

Fig 2 Iterative refinement algorithm proposed by Manne and S^revik

[23]

Fig 3 (a) Standard probe algorithm with OethP logNTHORN complexity (b) O

greater than a given B value Probe algorithms exploitthe greedy-choice property for existence and construc-tion of P The greedy choice here is to minimizeremaining work after loading processor Pp subject toLppB for p frac14 1yP 1 in order PROBEethBTHORN func-tions given in Fig 3 exploit this greedy property asfollows PROBE finds the largest index s1 so thatW1s1pB and assigns subchain T 1s1 to processor P1

with load L1 frac14 W1s1 Hence the first task in the secondprocessor is ts1thorn1 PROBE then similarly finds thelargest index s2 so that Ws1thorn1s2pB and assigns thesubchain T s1thorn1s2 to processor P2 This process continuesuntil either all tasks are assigned or all processors areexhausted The former and latter cases indicate feasi-bility and infeasibility of B respectivelyFig 3(a) illustrates the standard probe algorithm The

indices s1 s2y sP1 are efficiently found throughbinary search (BINSRCH) on the prefix-summed W-array In this figure BINSRCHethW iNBsumTHORN searchesW in the index range frac12iN to compute the index ipjpN

such that Wfrac12 jpBsum and Wfrac12 j thorn 14Bsum Thecomplexity of the standard probe algorithm isOethP log NTHORN Han et al [12] proposed an OethP log N=PTHORN-time probe algorithm (see Fig 3(b)) exploiting P

repeated binary searches on the same W arraywith increasing search values Their algorithm dividesthe chain into P subchains of equal length Ateach probe a linear search is performed on theweights of the last tasks of these P subchains to findout in which subchain the search value could beand then binary search is performed on the respectivesubchain of length N=P Note that since the probesearch values always increase linear search can beperformed incrementally that is search continuesfrom the last subchain that was searched to the rightwith OethPTHORN total cost This gives a total cost ofOethP logethN=PTHORNTHORN for P binary searches thus for the probefunction

ethP logethN=PTHORNTHORN-time probe algorithm proposed by Han et al [12]

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 979

341 Bisection as an approximation algorithm

Let f ethBTHORN be the binary-valued function where f ethBTHORN frac141 if PROBEethBTHORN is true and f ethBTHORN frac14 0 if PROBEethBTHORN isfalse Clearly f ethBTHORN is nondecreasing in B and Bopt liesbetween LB frac14 B13 frac14 Wtot=P and UB frac14 Wtot Theseobservations are exploited in the bisection algorithmleading to an efficient e-approximate algorithm where eis the desired precision The interval frac12Wtot=PWtot isconceptually discretized into ethWtot Wtot=PTHORN=e bottle-neck values and binary search is used in this range tofind the minimum feasible bottleneck value Bopt Thebisection algorithm as illustrated in Fig 4 performsOethlogethWtot=eTHORNTHORN PROBE calls and each PROBE callcosts OethP logethN=PTHORNTHORN Hence the bisection algorithmruns in OethN thorn P logethN=PTHORN logethWtot=eTHORNTHORN time whereOethNTHORN cost comes from the initial prefix-sum operationon W The performance of this algorithm deteriorateswhen logethWtot=eTHORN is comparable with N

342 Nicolrsquos algorithm

Nicolrsquos algorithm [27] exploits the fact that anycandidate B value is equal to weight Wij of a subchainA naive solution is to generate all subchain weights ofthe form Wij sort them and then use binary search tofind the minimum Wab value for which PROBEethWabTHORN frac14TRUE Nicolrsquos algorithm efficiently searches for theearliest range Wab for which Bopt frac14 Wab by consideringeach processor in order as a candidate bottleneckprocessor in an optimal mapping Let Popt be theoptimal mapping constructed by greedy PROBEethBoptTHORNand let processor Pb be the first bottleneck processorwith load Bopt in Popt frac14 s0 s1y sby sPS Underthese assumptions this greedy construction of Popt

ensures that each processor Pp preceding Pb is loadedas much as possible with LpoBopt for p frac14 1 2y b 1in Popt Here PROBEethLpTHORN frac14 FALSE since LpoBoptand PROBEethLp thorn Wspthorn1THORN frac14 TRUE since adding onemore task to processor Pp increases its load to Lp thornwspthorn14Bopt Hence if b frac14 1 then s1 is equal to thesmallest index i1 such that PROBEethW1i1THORN frac14 TRUE and

Fig 4 Bisection as an e-approximation algorithm

Bopt frac14 B1 frac14 W1s1 However if b41 then because of thegreedy choice property P1 should be loaded as much aspossible without exceeding Bopt frac14 BboB1 which impliesthat s1 frac14 i1 1 and hence L1 frac14 W1i11 If b frac14 2 then s2is equal to the smallest index i2 such thatPROBEethWi1i2THORN frac14 TRUE and Bopt frac14 B2 frac14 Wi1i2 Ifb42 then s2 frac14 i2 1 We iterate for b frac14 1 2yP 1 computing ib as the smallest index for whichPROBEethWib1ibTHORN frac14 TRUE and save Bb frac14 Wib1ib withiP frac14 N Finally the optimal bottleneck value is selectedas Bopt frac14 min1pbpP BbFig 5 illustrates Nicolrsquos algorithm As seen in this

figure given ib1 ib is found by performing a binarysearch over all subchain weights of the form Wib1j forib1pjpN in the bth iteration of the forndashloop HenceNicolrsquos algorithm performs Oethlog NTHORN PROBE calls tofind ib at iteration b and each probe call costsOethP logethN=PTHORNTHORN Thus the cost of computing anindividual Bb value is OethP log N logethN=PTHORNTHORN Since P 1 such Bb values are computed the overall complexity ofNicolrsquos algorithm is OethN thorn P2 log N logethN=PTHORNTHORN whereOethNTHORN cost comes from the initial prefix-sum operationon WTwo possible implementations of Nicolrsquos algorithm

are presented in Fig 5 Fig 5(a) illustrates a straightfor-ward implementation whereas Fig 5(b) illustrates acareful implementation which maintains and uses theinformation from previous probes to answer withoutcalling the PROBE function As seen in Fig 5(b) thisinformation is efficiently maintained as an undeterminedbottleneck-value range ethLBUBTHORN which is dynamicallyrefined after each probe Any bottleneck value encoun-tered outside the current range is immediately acceptedor rejected Although this simple scheme does notimprove the asymptotic complexity of the algorithm itdrastically reduces the number of probes as discussed inSection 6

4 Proposed CCP algorithms

In this section we present proposed methods Firstwe describe how to bound the separator indices for anoptimal solution to reduce the search space Then weshow how this technique can be used to improve theperformance of the dynamic programming algorithmWe continue with our discussion on improving the MSalgorithm and propose a novel refinement algorithmwhich we call the bidding algorithm Finally we discussparametric search methods proposing improvementsfor the bisection and Nicolrsquos algorithms

41 Restricting the search space

Our proposed CCP algorithms exploit lower andupper bounds on the optimal bottleneck value to restrict

ARTICLE IN PRESS

Fig 5 Nicolrsquos [27] algorithm (a) straightforward implementation (b) careful implementation with dynamic bottleneck-value bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996980

the search space for separator values as a preprocessingstep Natural lower and upper bounds for the optimalbottleneck value Bopt of a given CCP problem instanceethWNPTHORN are LB frac14 maxfB13wmaxg and UB frac14B13 thorn wmax respectively where B13 frac14 Wtot=P SincewmaxoB13 in coarse grain parallelization of most real-world applications our presentation will be forwmaxoB13 frac14 LB even though all results are valid whenB13 is replaced with maxfB13wmaxg The following lemmadescribes how to use these natural bounds on Bopt torestrict the search space for the separator values

Lemma 41 For a given CCP problem instance

ethWNPTHORN if Bf is a feasible bottleneck value in the

range frac12B13B13 thorn wmax then there exists a partition P frac14s0 s1y sPS of cost CethPTHORNpBf with SLppsppSHpfor 1ppoP where SLp and SHp are respectively the

smallest and largest indices such that

W1SLpXpethB13 wmaxethP pTHORN=PTHORN and

W1SHpppethB13 thorn wmaxethP pTHORN=PTHORN

Proof Let Bf frac14 B13 thorn w where 0pwowmax Partition Pcan be constructed by PROBEethBTHORN which loads the firstp processors as much as possible subject to LqpBf forq frac14 1 2y p In the worst case wspthorn1 frac14 wmax for eachof the first p processors Thus we have W1sp

Xf ethwTHORN frac14pethB13 thorn w wmaxTHORN for p frac14 1 2yP 1 However it

should be possible to divide the remaining subchainT spthorn1N into P p parts without exceeding Bf ieWspthorn1NpethP pTHORN ethB13 thorn wTHORN Thus we also haveW1sp

XgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn wTHORN Note that f ethwTHORNis an increasing function of w whereas gethwTHORN is adecreasing function of w The minimum ofmaxf f ethwTHORN gethwTHORNg is at the intersection of f ethwTHORN andgethwTHORN so that W1sp

XpethB13 wmaxethP pTHORN=PTHORNTo prove the upper bounds we can start with

W1sppf ethwTHORN frac14 pethB13 thorn wTHORN which holds when Lq frac14 B13 thorn

w for q frac14 1y p The condition Wspthorn1NXethP pTHORNethB13 thornw wmaxTHORN however ensures feasibility of Bf frac14 B13 thorn wsince PROBEethBTHORN can always load each of the remainingethP pTHORN processors with B13 thorn w wmax Thus we alsohave W1sp

pgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn w wmaxTHORNHere f ethwTHORN is an increasing function of w whereasgethwTHORN is a decreasing function of w which yieldsW1sp

ppethB13 thorn wmaxethP pTHORN=PTHORN amp

Corollary 42 The separator range weights are DWp frac14PSHp

ifrac14SLpwi frac14 W1SHp

W1SLpfrac14 2wmaxpethP pTHORN=P with a

maximum value Pwmax=2 at p frac14 P=2

Applying this corollary requires finding wmax whichentails an overhead equivalent to that of the prefix-sumoperation and hence should be avoided In this workwe adopt a practical scheme to construct the bounds onseparator indices We run the RB heuristic to find a

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 981

hopefully good bottleneck value BRB and use BRB as anupper bound for bottleneck values ie UB frac14 BRBThen we run LR-PROBEethBRBTHORN and RL-PROBEethBRBTHORNto construct two mappings P1 frac14 h10 h11y h1PS andP2 frac14 c20 c

21y c2PS with CethP1THORNCethP2THORNpBRB Here

LR-PROBE denotes the left-to-right probe given inFig 3 whereas RL-PROBE denotes a right-to-leftprobe function which can be considered as the dual ofthe LR-PROBE RL-PROBE exploits the greedy-choiceproperty from right to left That is RL-PROBE assignssubchains from the right end towards the left end of thetask chain to processors in the order PPPP1yP1From these two mappings lower and upper boundvalues for sp separator indices are constructed as SLp frac14c2p and SHp frac14 h1p respectively These bounds are furtherrefined by running LR-PROBEethB13THORN andRL-PROBEethB13THORN to construct two mappings P3 frac14c30 c

31y c3PS and P4 frac14 h40 h41y h4PS and then

defining SLp frac14 maxfSLp c3pg and SHp frac14 minfSHp h4pg

for 1ppoP Lemmas 43 and 44 prove correctness ofthese bounds

Lemma 43 For a given CCP problem instance ethWNPTHORNand a feasible bottleneck value Bf let P1 frac14h10 h11y h1PS and P2 frac14 c20 c

21y c2PS be the parti-

tions constructed by LR-PROBEethBfTHORN and RL-PROBEethBfTHORNrespectively Then any partition P frac14 s0 s1y sPS of cost

CethPTHORN frac14 BpBf satisfies c2ppspph1p

Proof By the property of LR-PROBEethBfTHORN h1p is thelargest index such that T 1h1p

can be partitioned into p

parts without exceeding Bf If sp4h1p then the bottle-neck value will exceed Bf and thus B By the property ofRL-PROBEethBfTHORN c2p is the smallest index where T c2pNcan be partitioned into P p parts without exceedingBf If spoc2p then the bottleneck value will exceed Bf

and thus B amp

Lemma 44 For a given CCP problem instance

ethWNPTHORN let P3 frac14 c30 c31y c3PS and P4 frac14

h40 h41y h4PS be the partitions constructed by

LR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value Bf there exists a

partition P frac14 s0 s1y sPS of cost CethPTHORNpBf that

satisfies c3ppspph4p

Proof Consider the partition P frac14 s0 s1y sPS con-structed by LR-PROBEethBfTHORN It is clear that thispartition already satisfies the lower bounds iespXc3p Assume sp4h4p then partition P0 obtained bymoving sp back to h4p also yields a partition with costCethP0THORNpBf since T h4pthorn1N can be partitioned into P p

parts without exceeding B13 amp

The difference between Lemmas 43 and 44 is that theformer ensures the existence of all partitions with

costpBf within the given separator-index rangeswhereas the latter ensures only the existence of at leastone such partition within the given ranges Thefollowing corollary combines the results of these twolemmas

Corollary 45 For a given CCP problem instance

ethWNPTHORN and a feasible bottleneck value Bf let

P1 frac14 h10 h11y h1PS P2 frac14 c20 c21y c2PS P3 frac14

c30 c31y c3PS and P4 frac14 h40 h

41y h4PS be the parti-

tions constructed by LR-PROBEethBfTHORN RL-PROBEethBfTHORNLR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value B in the range

frac12B13Bf there exists a partition P frac14 s0 s1y sPS of

cost CethPTHORNpB with SLppsppSHp for 1ppoP where

SLp frac14 maxfc2p c3pg and SHp frac14 minfh1p h4pg

Corollary 46 The separator range weights become

DWp frac14 2 minf p ethP pTHORNgwmax in the worst case with a

maximum value Pwmax at p frac14 P=2

Lemma 41 and Corollary 45 infer the followingtheorem since B13pBoptpB13 thorn wmax

Theorem 47 For a given CCP problem instance

ethWNPTHORN and SLp and SHp index bounds constructed

according to Lemma 41 or Corollary 45 there exists

an optimal partition Popt frac14 s0 s1y sPS with SLppsppSHp for 1ppoP

Comparison of separator range weights in Lemma 41and Corollary 45 shows that separator range weightsproduced by the practical scheme described inCorollary 45 may be worse than those of Lemma 41by a factor of two This is only the worst-case behaviorhowever and the practical scheme normally finds muchbetter bounds since order in the chain usually preventsthe worst-case behavior and BRBoB13 thorn wmax Experi-mental results in Section 6 justify this expectation

411 Complexity analysis models

Corollaries 42 and 46 give bounds on the weights ofthe separator-index ranges However we need boundson the sizes of these separator-index ranges forcomputational complexity analysis of the proposedCCP algorithms Here the size DSp frac14 SHp SLp thorn 1denotes the number of tasks within the pth rangefrac12SLpSHp Miguet and Pierson [24] propose the modelwi frac14 yethwavgTHORN for i frac14 1 2yN to prove that their H1and H2 heuristics allocate yethN=PTHORN tasks to eachprocessor where wavg frac14 Wtot=N is the average taskweight This assumption means that the weight of eachtask is not too far away from the average task weightUsing Corollaries 42 and 45 this model induces DSp frac14OethP wmax=wavgTHORN Moreover this model can be exploitedto induce the optimistic bound DSp frac14 OethPTHORN However

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996982

we find their model too restrictive since the minimumand maximum task weights can deviate substantiallyfrom wavg Here we establish a looser and more realisticmodel on task weights so that for any subchain T ij

with weight Wij sufficiently larger than wmax theaverage task weight within subchain T ij satisfiesOethwavgTHORN That is Dij frac14 j i thorn 1 frac14 OethWij=wavgTHORN Thismodel referred to here as model M directly inducesDSp frac14 OethPwmax=wavgTHORN since DWppDWP=2 frac14 Pwmax=2for p frac14 1 2yP 1

42 Dynamic-programming algorithm with static

separator-index bounding

The proposed DP algorithm referred to here as theDP+ algorithm exploits bounds on the separatorindices for an efficient solution Fig 6 illustrates theproposed DP+ algorithm where input parameters SL

and SH denote the index bound arrays each of size Pcomputed according to Corollary 45 with Bf frac14 BRBNote that SLP frac14 SHP frac14 N since only Bfrac12PN needs tobe computed in the last row As seen in Fig 6 only B

pj

values for j frac14 SLpSLp thorn 1ySHp are computed ateach row p by exploiting Corollary 45 which ensuresexistence of an optimal partition Popt frac14 s0 s1y sPSwith SLppsppSHp Thus these B

pj values will suffice

for correct computation of Bpthorn1i values for i frac14

SLpthorn1SLpthorn1 thorn 1ySHpthorn1 at the next row p thorn 1As seen in Fig 6 explicit range checking is avoided in

this algorithm for utmost efficiency However the j-index may proceed beyond SHp to SHp thorn 1 within therepeatndashuntil-loop while computing B

pthorn1i with

SLpthorn1pipSHpthorn1 in two cases In both cases functionsWjthorn1i and B

pj intersect in the open interval ethSHpSHp thorn

1THORN so that BpSHp

oWSHpthorn1i and BpSHpthorn1XWSHpthorn2i In the

first case i frac14 SHp thorn 1 so that Wjthorn1i and Bpj intersect in

Fig 6 Dynamic-programming algorithm with static separator-index

bounding

ethi 1 iTHORN which implies that Bpthorn1i frac14 Wi1i with j

pthorn1i frac14

SLp since Wi1ioBpi as mentioned in Section 32 In

the second case i4SHp thorn 1 for which Corollary 45guarantees that B

pthorn1i frac14 WSLpthorn1ipB

pSHpthorn1 and thus we

can safely select jpthorn1i frac14 SLp Note that WSLpthorn1i frac14

BpSHpthorn1 may correspond to a case leading to another

optimal partition with jpthorn1i frac14 spthorn1 frac14 SHp thorn 1 As seen in

Fig 6 both cases are efficiently resolved simply bystoring N to B

pSHpthorn1 as a sentinel Hence in such cases

the condition WSHpthorn1ioBpSHpthorn1 frac14 N in the ifndashthen

statement following the repeatndashuntil-loop is always trueso that the j-index automatically moves back to SHpThe scheme of computing B

pSHpthorn1 for each row p which

seems to be a natural solution does not work sincecorrect computation of B

pthorn1SHpthorn1thorn1 may necessitate more

than one Bpj value beyond the SHp index bound

A nice feature of the DP approach is that it can beused to generate all optimal partitions by maintaining aP N matrix to store the minimum j

pi index values

defining the Bpi values at the expense of increased

execution time and increased asymptotic space require-ment Recall that index bounds SL and SH computedaccording to Corollary 45 restrict the search space forat least one optimal solution The index bounds can becomputed according to Lemma 44 for this purposesince the search space restricted by Lemma 44 includesall optimal solutionsThe running time of the proposed DP+ algorithm is

OethN thorn P log NTHORN thornPP

pfrac141 yethDSpTHORN Here OethNTHORN cost comesfrom the initial prefix-sum operation on the W arrayand OethP log NTHORN cost comes from the running time of theRB heuristic and computing the separator-index boundsSL and SH according to Corollary 45 Under modelM DSp frac14 OethPwmax=wavgTHORN and hence the complexity isOethN thorn P log N thorn P2 wmax = wavgTHORN The algorithm be-comes linear in N when the separator-index ranges donot overlap which is guaranteed by the conditionwmax frac14 Oeth2Wtot=P2THORN

43 Iterative refinement algorithms

In this work we improve the MS algorithm andpropose a novel CCP algorithm namely the bidding

algorithm which is run-time efficient for small-to-medium number of processors The main differencebetween the MS and the bidding algorithms is asfollows the MS algorithm moves along a series offeasible bottleneck values whereas the bidding algo-rithm moves along a sequence of infeasible bottleneckvalues so that the first feasible bottleneck value becomesthe optimal value

431 Improving the MS algorithm

The performance of the MS algorithm stronglydepends on the initial partition The initial partitionproposed by Manne and S^revik [23] satisfies the leftist

ARTICLE IN PRESS

Fig 7 Bidding algorithm

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 983

partition constraint but it leads to very poor run-timeperformance Here we propose using the partitiongenerated by PROBEethB13THORN as an initial partition Thispartition is also a leftist partition since moving anyseparator to the left will not decrease the load of thebottleneck processor This simple observation leads tosignificant improvement in run-time performance of thealgorithm Also using a heap as a priority queue does notgive better run-time performance than using a runningmaximum despite its superior asymptotic complexity Inour implementation we use a running maximum

432 Bidding algorithm

This algorithm increases the bottleneck value gradu-ally starting from the ideal bottleneck value B13 until itfinds a feasible partition which is also optimal Considera partition Pt frac14 s0 s1y sPS constructed byPROBEethBtTHORN for an infeasible Bt After detecting theinfeasibility of this Bt value the point is to determine thenext larger bottleneck value B to be investigatedClearly the separator indices of the partitions to beconstructed by future PROBEethBTHORN calls with B4Bt willnever be to the left of the respective separator indices ofPt Moreover at least one of the separators shouldmove right for feasibility since the load of the lastprocessor determines infeasibility of the current Bt value(ie LP4Bt) To avoid missing the smallest feasiblebottleneck value the next larger B value is selected asthe minimum of processor loads that will be obtained bymoving the end-index of every processor to the right byone position That is the next larger B value is equal tominfmin1ppoP fLp thorn wspthorn1gLPg Here we call theLp thorn wspthorn1 value the bid of processor Pp which refersto the load of Pp if the first task tspthorn1 of the nextprocessor is augmented to Pp The bid of the lastprocessor PP is equal to the load of the remaining tasksIf the smallest bid B comes from processor Pb probingwith new B is performed only for the remainingprocessors PbPbthorn1yPPS in the suffix Wsb1thorn1Nof the W arrayThe bidding algorithm is presented in Fig 7 The

innermost whilendashloop implements a linear probingscheme such that the new positions of the separatorsare determined by moving them to the right one by oneThis linear probing scheme is selected because newpositions of separators are likely to be in a closeneighborhood of previous ones Note that binary searchis used only for setting the separator indices for the firsttime After the separator index sp is set for processor Pp

during linear probing the repeatndashuntil-loop terminates ifit is not possible to partition the remaining subchainT spthorn1N into P p processors without exceeding thecurrent B value ie rbid frac14 Lr=ethP pTHORN4B where Lr

denotes the weight of the remaining subchain In thiscase the next larger B value is determined by consider-ing the best bid among the first p processors and rbid

As seen in Fig 7 we maintain a prefix-minimumarray BIDS for computing the next larger B value HereBIDS is an array of records of length P whereBIDSfrac12 pB and BIDSfrac12 pb store the best bid value ofthe first p processors and the index of the definingprocessor respectively BIDSfrac120 helps the correctness ofthe running prefix-minimum operationThe complexity of the bidding algorithm for integer

task weights under model M is OethN thorn P log N thornP wmax thorn P2ethwmax=wavgTHORNTHORN Here OethNTHORN cost comes fromthe initial prefix-sum operation on the W array andOethP log NTHORN cost comes from initial settings of separatorsthrough binary search The B value is increased at mostBopt B13owmax times and each time the next B valuecan be computed in OethPTHORN time which induces the costOethP wmaxTHORN The total area scanned by the separators is atmost OethP2ethwmax=wavgTHORNTHORN For noninteger task weightscomplexity can reach OethP log N thorn P3ethwmax=wavgTHORNTHORN inthe worst case which occurs when only one separatorindex moves to the right by one position at each B valueWe should note here that using a min-heap for findingthe next B value enables terminating a repeat-loopiteration as soon as a separator-index does not moveThe trade-off in this scheme is the Oethlog PTHORN cost incurredat each separator-index move due to respectivekey-update operations on the heap We implementedthis scheme as well but observed increased executiontimes

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 975

algorithm proposed by Manne and S^revik [23] fallsinto this class The parametric-search approach relies onrepeatedly probing for a partition with a bottleneck

value no greater than a given value Complexity ofprobing is yethNTHORN since each task has to be examined butcan be reduced to OethP log NTHORN through binary searchafter an initial yethNTHORN-time prefix-sum operation on thetask chain [18] Later the complexity was reduced toOethP logethN=PTHORNTHORN by Han et al [12]The parametric-search approach goes back to Iqbalrsquos

work [16] describing an e-approximate algorithm whichperforms OethlogethWtot=eTHORNTHORN probe calls Here Wtot denotesthe total task weight and e40 denotes the desiredaccuracy Iqbalrsquos algorithm exploits the observation thatthe bottleneck value is in the range frac12Wtot=PWtot andperforms binary search in this range by makingOethlogethWtot=eTHORNTHORN probes This work was followed byseveral exact algorithms involving efficient schemes forsearching over bottleneck values by considering onlysubchain weights Nicol and OrsquoHallaron [2829] pro-posed a search scheme that requires at most 4N probesIqbal and Bokhari [17] relaxed the restriction of thisalgorithm on bounded task weight and communicationcost by proposing a condensation algorithm Iqbal [15]and Nicol [2729] concurrently proposed an efficientsearch scheme that finds an optimal partition after onlyOethP log NTHORN probes Asymptotically more efficient algo-rithms were proposed by Frederickson [78] and Hanet al [12] Frederickson proposed an OethNTHORN-time optimalalgorithm Han et al proposed a recursive algorithmwith complexity OethN thorn P1thornETHORN for any small E40 Thesetwo studies have focused on asymptotic complexitydisregarding practiceDespite these efforts heuristics are still commonly

used in the parallel computing community and designof efficient heuristics is still an active area of research[24] The reasons for preferring heuristics are ease ofimplementation efficiency the expectation that heur-istics yield good partitions and the misconception thatexact algorithms are not affordable as a preprocessingstep for efficient parallelization By contrast our workproposes efficient exact CCP algorithms Implementa-tion details and pseudocodes for proposed algorithmsare presented for clarity and reproducibility We alsodemonstrate that qualities of the decompositions ob-tained through heuristics differ substantially from thoseof optimal ones through experiments on a wide range ofreal-world problemsFor the runtime efficiency of our algorithms we use

an effective heuristic as a preprocessing step to find agood upper bound on the optimal bottleneck valueThen we exploit lower and upper bounds on the optimalbottleneck value to restrict the search space forseparator-index values This separator-index boundingscheme is exploited in a static manner in the DPalgorithm drastically reducing the number of table

entries computed and referenced A dynamic separator-index bounding scheme is proposed for parametricsearch algorithms narrowing separator-index rangesafter each probe The upper bound on the optimalbottleneck value is also exploited to find a much betterinitial partition for the iterative-refinement algorithmproposed by Manne and S^revik [23] We also propose adifferent iterative-refinement technique which is veryfast for small-to-medium number of processors Ob-servations behind this algorithm are further used toincorporate the subchain-weight concept into Iqbalrsquos[16] approximate bisection algorithm to make it an exactalgorithmTwo applications are investigated in our experiments

sparse matrixndashvector multiplication (SpMxV) which isone of the most important kernels in scientific comput-ing and image-space parallel volume rendering which iswidely used for scientific visualization Integer and realvalued workload arrays arising in these two applicationsare their distinctive features Furthermore SpMxV afine-grain application demonstrates the feasibility ofusing optimal load balancing algorithms even in sparse-matrix decomposition Experiments with proposed CCPalgorithms on a wide range of sparse test matrices showthat 64-way decompositions can be computed 100 timesfaster than a single SpMxV time while reducing the loadimbalance by a factor of four over the most effectiveheuristic Experimental results on volume renderingdataset show that exact algorithms can produce 38times better 64-way decompositions than the mosteffective heuristic while being only 11 percent sloweron averageThe remainder of this article is organized as follows

Table 1 displays the notation used in the paper Section 2defines the CCP problem A survey of existing CCPalgorithms is presented in Section 3 Proposed CCPalgorithms are discussed in Section 4 Load-balancingapplications used in our experiments are described inSection 5 and performance results are discussed inSection 6

2 Chains-on-chains partitioning (CCP) problem

In the CCP problem a computational problemdecomposed into a chain T frac14 t1 t2y tNS of N

taskmodules with associated positive computationalweights W frac14 w1w2ywNS is to be mapped ontoa chain P frac14 P1P2yPPS of P homogeneous

processors It is worth noting that there are noprecedence constraints among the tasks in the chain Asubchain of T is defined as a subset of contiguous tasksand the subchain consisting of tasks ti tithorn1y tjS isdenoted as T ij The computational load Wij ofsubchain T ij is Wij frac14

Pjhfrac14i wh From the contiguity

constraint a partition P should map contiguous

ARTICLE IN PRESS

Table 1

The summary of important abbreviations and symbols

Notation Explanation

N number of tasks

P number of processors

P processor chain

Pi ith processor in the processor chain

T task chain ie T frac14 t1 t2y tNSti ith task in the task chain

T ij subchain of tasks starting from ti upto tj ie T ij frac14ti tithorn1y tjS

T pi subproblem of p-way partitioning of the first i tasks in

the task chain T

wi computational load of task ti

wmax maximum computational load among all tasks

wavg average computational load of all tasks

Wij total computational load of task subchain Tij

Wtot total computational load

Wfrac12i total weight of the first i tasks

Ppi partition of first i tasks in the task chain onto the first p

processors in the processor chain

Lp load of the pth processor in a partition

UB upperbound on the value of an optimal solution

LB lower bound on the value of an optimal solution

B13 ideal bottleneck value achieved when all processors

have equal load

Bpi optimal solution value for p-way partitioning of the

first i tasks

sp index of the last task assigned to the pth processor

SLp lowest position for the pth separator index in an

optimal solution

SHp highest position for the pth separator index in an

optimal solution

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996976

subchains to contiguous processors Hence a P-waychain-partition PP

N of a task chain T with N tasks ontoa processor chain P with P processors is described by asequence PP

N frac14 s0 s1 s2y sPS of P thorn 1 separatorindices where s0 frac14 0ps1ppsP frac14 N Here sp de-notes the index of the last task of the pth part so that Pp

gets the subchain T sp1thorn1spwith load Lp frac14 Wsp1thorn1sp

The cost CethPTHORN of a partition P is determined by themaximum processor execution time among all proces-sors ie CethPTHORN frac14 B frac14 max1pppP fLpg This B value of apartition is called its bottleneck value and the processorpart defining it is called the bottleneck processorpartThe CCP problem can be defined as finding a mappingPopt that minimizes the bottleneck value Bopt frac14 CethPoptTHORN

3 Previous work

Each CCP algorithm discussed in this section andSection 4 involves an initial prefix-sum operation on thetask-weight array W to enhance the efficiency ofsubsequent subchain-weight computations The prefix-sum operation replaces the ith entry Wfrac12i with the sumof the first i entries eth

Pihfrac141 whTHORN so that computational

load Wij of a subchain T ij can be efficiently determined

as Wfrac12 j Wfrac12i 1 in Oeth1THORN-time In our discussions Wis used to refer to the prefix-summed W-array and theyethNTHORN cost of this initial prefix-sum operation isconsidered in the complexity analysis The presentationsfocus only on finding the bottleneck value Bopt becausea corresponding optimal mapping can be constructedeasily by making a PROBEethBoptTHORN call as discussed inSection 34

31 Heuristics

The most commonly used partitioning heuristic isbased on recursive bisection ethRBTHORN RB achieves P-waypartitioning through log P bisection levels where P is apower of 2 At each bisection step in a level the currentchain is divided evenly into two subchains Althoughoptimal division can be easily achieved at every bisectionstep the sequence of optimal bisections may lead topoor load balancing RB can be efficiently implementedin OethN thorn P log NTHORN time by first performing a prefix-sumoperation on the workload array W with complexityOethNTHORN and then making P 1 binary searches in theprefix-summed W-array each with complexityOethlog NTHORNMiguet and Pierson [24] proposed two other heur-

istics The first heuristic ethH1THORN computes the separatorvalues such that sp is the largest index such thatW1sp

ppB13 where B13 frac14 Wtot=P is the ideal bottleneckvalue and Wtot frac14

PNifrac141 wi denotes sum of all task

weights The second heuristic ethH2THORN further refines theseparator indices by incrementing each sp value found inH1 if ethW1spthorn1 pB13THORNoeth pB13 W1sp

THORN These two heur-istics can also be implemented in OethN thorn P log NTHORN timeby performing P 1 binary searches in the prefix-summed W-array Miguet and Pierson [24] have alreadyproved the upper bounds on the bottleneck values of thepartitions found by H1 and H2 as BH1BH2oB13 thornwmax where wmax frac14 max1pppN fwig denotes the max-imum task weight The following lemma establishes asimilar bound for the RB heuristic

Lemma 31 Let PRB frac14 s0 s1y sPS be an RB solu-

tion to a CCP problem ethWNPTHORN Then BRB frac14 CethPRBTHORNsatisfies BRBpB13 thorn wmaxethP 1THORN=P

Proof Consider the first bisection step There exists apivot index 1pi1pN such that both sides weigh lessthan Wtot=2 without the i1th task and more than Wtot=2with it That is

W1i11Wi1thorn1NpWtot=2pW1i1 Wi1N

The worst case for RB occurs when wi1 frac14 wmax andW1i11 frac14 Wi1thorn1N frac14 ethWtot wmaxTHORN=2 Without loss ofgenerality assume that ti1 is assigned to the left part sothat sP=2 frac14 i1 and W1sP=2

frac14 Wtot=2thorn wmax=2 In asimilar worst-case bisection of T 1sP=2

there exists an

ARTICLE IN PRESS

Fig 1 OethethN PTHORNPTHORN-time dynamic-programming algorithm proposed

by Choi and Narahari [6] and Olstad and Manne [30]

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 977

index i2 such that wi2 frac14 wmax and W1i21 frac14 Wi2thorn1sP=2frac14

ethWtot wmaxTHORN=4 and ti2 is assigned to the left part sothat sP=4 frac14 i2 and W1sP=4

frac14 ethWtot wmaxTHORN=4thorn wmax frac14Wtot=4thorn eth3=4THORNwmax For a sequence of log P such worst-case bisection steps on the left parts processor P1 will bethe bottleneck processor with load BRB frac14 W1s1 frac14Wtot=P thorn wmaxethP 1THORN=P amp

32 Dynamic programming

The overlapping subproblem space can be defined asT p

i for p frac14 1 2yP and i frac14 p p thorn 1yN P thorn pwhere T p

i denotes a p-way CCP of prefix task-subchainT 1i frac14 t1 t2y tiS onto prefix processor-subchainP1p frac14 P1P2yPpS Notice that index i is restrictedto ppipN P thorn p range because there is no merit inleaving a processor empty From this subproblem spacedefinition the optimal substructure property of the CCPproblem can be shown by considering an optimalmapping Pp

i frac14 s0 s1y sp frac14 iS with a bottleneckvalue B

pi for the CCP subproblem T p

i If the lastprocessor is not the bottleneck processor in Pp

i thenPp1

sp1frac14 s0 s1y sp1S should be an optimal mapping

for the subproblem T p1sp1

Hence recursive definition forthe bottleneck value of an optimal mapping is

Bpi frac14 min

p1pjoifmaxfB

p1j Wjthorn1igg eth1THORN

In (1) searching for index j corresponds to searching forseparator sp1 so that the remaining subchain T jthorn1i isassigned to the last processor Pp in an optimal mappingPp

i of T pi The bottleneck value BP

N of an optimalmapping can be computed using (1) in a bottom-upfashion starting from B1

i frac14 W1i for i frac14 1 2yN Aninitial prefix-sum on the workload array W enablesconstant-time computation of subchain weight of theform Wjthorn1i through Wjthorn1i frac14 Wfrac12i Wfrac12 j ComputingB

pi using (1) takes OethN pTHORN time for each i and p and

thus the algorithm takes OethethN PTHORN2PTHORN time since thenumber of distinct subproblems is equal to ethN P thorn 1THORNPChoi and Narahari [6] and Olstad and Manne [30]

reduced the complexity of this scheme to OethNPTHORN andOethethN PTHORNPTHORN respectively by exploiting the followingobservations that hold for positive task weights For afixed p in (1) the minimum index value j

pi defining B

pi

cannot occur at a value less than the minimum indexvalue j

pi1 defining B

pi1 ie j

pi1pj

pi pethi 1THORN Hence the

search for the optimal jpi can start from j

pi1 In (1) B

p1j

for a fixed p is a nondecreasing function of j and Wjthorn1ifor a fixed i is a decreasing function of j reducing to 0 atj frac14 i Thus two cases occur in a semi-closed intervalfrac12 j

pi1 iTHORN for j If Wjthorn1i4B

p1j initially then these two

functions intersect in frac12 jpi1 iTHORN In this case the search for

jpi continues until Wj13thorn1ipB

p1j and then only j13 and

j13 1 are considered for setting jpi with j

pi frac14 j13 if

Bp1j13 pWji and j

pi frac14 j13 1 otherwise Note that this

scheme automatically detects jpi frac14 i 1 if Wjthorn1i and

Bp1j intersect in the open interval ethi 1 iTHORN However if

Wjthorn1ipBp1j initially then B

p1j lies above Wjthorn1i in the

closed interval frac12 jpi1 i In this case the minimum value

occurs at the first value of j ie jpi frac14 j

pi1 These

improvements lead to an OethethN PTHORNPTHORN-time algorithmsince computation of all B

pi values for a fixed p makes

OethN PTHORN references to already computed Bp1j values

Fig 1 displays a run-time efficient implementation ofthis OethethN PTHORNPTHORN-time DP algorithm which avoids theexplicit minndashmax operation required in (1) In Fig 1 B

pi

values are stored in a table whose entries are computedin row-major order

33 Iterative refinement

The algorithm proposed by Manne and S^revik [23]referred to here as the MS algorithm finds a sequence ofnonoptimal partitions such that there is only one wayeach partition can be improved For this purpose theyintroduce the leftist partition (LP) Consider a partitionP such that Pp is the leftmost processor containing atleast two tasks P is defined as an LP if increasing theload of any processor Pc that lies to the right of Pp byaugmenting the last task of Pc1 to Pc makes Pc abottleneck processor with a load XCethPTHORN Let P be anLP with bottleneck processor Pb and bottleneck value BIf Pb contains only one task then P is optimal On theother hand assume that Pb contains at least two tasksThe refinement step which is shown by the inner whilendashloop in Fig 2 tries to find a new LP of lower cost bysuccessively removing the first task of Pp and augment-ing it to Pp1 for p frac14 b b 1y until LpoB Refine-ment fails when the whilendashloop proceeds until p frac14 1 withLpXB Manne and S^revik proved that a successfulrefinement of an LP gives a new LP and the LP isoptimal if the refinement fails They proposed using aninitial LP in which the P 1 leftmost processors each

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996978

has only one task and the last processor contains theremaining tasks The MS algorithm moves eachseparator index at most N P times so that the totalnumber of separator-index moves is OethPethN PTHORNTHORN Amax-heap is maintained for the processor loads to find abottleneck processor at the beginning of each repeat-loop iteration The cost of each separator-index move isOethlog P) since it necessitates one decrease-key and oneincrease-key operations Thus the complexity of the MSalgorithm is OethPethN PTHORN log PTHORN

34 Parametric search

The parametric-search approach relies on repeatedprobing for a partition P with a bottleneck value no

Fig 2 Iterative refinement algorithm proposed by Manne and S^revik

[23]

Fig 3 (a) Standard probe algorithm with OethP logNTHORN complexity (b) O

greater than a given B value Probe algorithms exploitthe greedy-choice property for existence and construc-tion of P The greedy choice here is to minimizeremaining work after loading processor Pp subject toLppB for p frac14 1yP 1 in order PROBEethBTHORN func-tions given in Fig 3 exploit this greedy property asfollows PROBE finds the largest index s1 so thatW1s1pB and assigns subchain T 1s1 to processor P1

with load L1 frac14 W1s1 Hence the first task in the secondprocessor is ts1thorn1 PROBE then similarly finds thelargest index s2 so that Ws1thorn1s2pB and assigns thesubchain T s1thorn1s2 to processor P2 This process continuesuntil either all tasks are assigned or all processors areexhausted The former and latter cases indicate feasi-bility and infeasibility of B respectivelyFig 3(a) illustrates the standard probe algorithm The

indices s1 s2y sP1 are efficiently found throughbinary search (BINSRCH) on the prefix-summed W-array In this figure BINSRCHethW iNBsumTHORN searchesW in the index range frac12iN to compute the index ipjpN

such that Wfrac12 jpBsum and Wfrac12 j thorn 14Bsum Thecomplexity of the standard probe algorithm isOethP log NTHORN Han et al [12] proposed an OethP log N=PTHORN-time probe algorithm (see Fig 3(b)) exploiting P

repeated binary searches on the same W arraywith increasing search values Their algorithm dividesthe chain into P subchains of equal length Ateach probe a linear search is performed on theweights of the last tasks of these P subchains to findout in which subchain the search value could beand then binary search is performed on the respectivesubchain of length N=P Note that since the probesearch values always increase linear search can beperformed incrementally that is search continuesfrom the last subchain that was searched to the rightwith OethPTHORN total cost This gives a total cost ofOethP logethN=PTHORNTHORN for P binary searches thus for the probefunction

ethP logethN=PTHORNTHORN-time probe algorithm proposed by Han et al [12]

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 979

341 Bisection as an approximation algorithm

Let f ethBTHORN be the binary-valued function where f ethBTHORN frac141 if PROBEethBTHORN is true and f ethBTHORN frac14 0 if PROBEethBTHORN isfalse Clearly f ethBTHORN is nondecreasing in B and Bopt liesbetween LB frac14 B13 frac14 Wtot=P and UB frac14 Wtot Theseobservations are exploited in the bisection algorithmleading to an efficient e-approximate algorithm where eis the desired precision The interval frac12Wtot=PWtot isconceptually discretized into ethWtot Wtot=PTHORN=e bottle-neck values and binary search is used in this range tofind the minimum feasible bottleneck value Bopt Thebisection algorithm as illustrated in Fig 4 performsOethlogethWtot=eTHORNTHORN PROBE calls and each PROBE callcosts OethP logethN=PTHORNTHORN Hence the bisection algorithmruns in OethN thorn P logethN=PTHORN logethWtot=eTHORNTHORN time whereOethNTHORN cost comes from the initial prefix-sum operationon W The performance of this algorithm deteriorateswhen logethWtot=eTHORN is comparable with N

342 Nicolrsquos algorithm

Nicolrsquos algorithm [27] exploits the fact that anycandidate B value is equal to weight Wij of a subchainA naive solution is to generate all subchain weights ofthe form Wij sort them and then use binary search tofind the minimum Wab value for which PROBEethWabTHORN frac14TRUE Nicolrsquos algorithm efficiently searches for theearliest range Wab for which Bopt frac14 Wab by consideringeach processor in order as a candidate bottleneckprocessor in an optimal mapping Let Popt be theoptimal mapping constructed by greedy PROBEethBoptTHORNand let processor Pb be the first bottleneck processorwith load Bopt in Popt frac14 s0 s1y sby sPS Underthese assumptions this greedy construction of Popt

ensures that each processor Pp preceding Pb is loadedas much as possible with LpoBopt for p frac14 1 2y b 1in Popt Here PROBEethLpTHORN frac14 FALSE since LpoBoptand PROBEethLp thorn Wspthorn1THORN frac14 TRUE since adding onemore task to processor Pp increases its load to Lp thornwspthorn14Bopt Hence if b frac14 1 then s1 is equal to thesmallest index i1 such that PROBEethW1i1THORN frac14 TRUE and

Fig 4 Bisection as an e-approximation algorithm

Bopt frac14 B1 frac14 W1s1 However if b41 then because of thegreedy choice property P1 should be loaded as much aspossible without exceeding Bopt frac14 BboB1 which impliesthat s1 frac14 i1 1 and hence L1 frac14 W1i11 If b frac14 2 then s2is equal to the smallest index i2 such thatPROBEethWi1i2THORN frac14 TRUE and Bopt frac14 B2 frac14 Wi1i2 Ifb42 then s2 frac14 i2 1 We iterate for b frac14 1 2yP 1 computing ib as the smallest index for whichPROBEethWib1ibTHORN frac14 TRUE and save Bb frac14 Wib1ib withiP frac14 N Finally the optimal bottleneck value is selectedas Bopt frac14 min1pbpP BbFig 5 illustrates Nicolrsquos algorithm As seen in this

figure given ib1 ib is found by performing a binarysearch over all subchain weights of the form Wib1j forib1pjpN in the bth iteration of the forndashloop HenceNicolrsquos algorithm performs Oethlog NTHORN PROBE calls tofind ib at iteration b and each probe call costsOethP logethN=PTHORNTHORN Thus the cost of computing anindividual Bb value is OethP log N logethN=PTHORNTHORN Since P 1 such Bb values are computed the overall complexity ofNicolrsquos algorithm is OethN thorn P2 log N logethN=PTHORNTHORN whereOethNTHORN cost comes from the initial prefix-sum operationon WTwo possible implementations of Nicolrsquos algorithm

are presented in Fig 5 Fig 5(a) illustrates a straightfor-ward implementation whereas Fig 5(b) illustrates acareful implementation which maintains and uses theinformation from previous probes to answer withoutcalling the PROBE function As seen in Fig 5(b) thisinformation is efficiently maintained as an undeterminedbottleneck-value range ethLBUBTHORN which is dynamicallyrefined after each probe Any bottleneck value encoun-tered outside the current range is immediately acceptedor rejected Although this simple scheme does notimprove the asymptotic complexity of the algorithm itdrastically reduces the number of probes as discussed inSection 6

4 Proposed CCP algorithms

In this section we present proposed methods Firstwe describe how to bound the separator indices for anoptimal solution to reduce the search space Then weshow how this technique can be used to improve theperformance of the dynamic programming algorithmWe continue with our discussion on improving the MSalgorithm and propose a novel refinement algorithmwhich we call the bidding algorithm Finally we discussparametric search methods proposing improvementsfor the bisection and Nicolrsquos algorithms

41 Restricting the search space

Our proposed CCP algorithms exploit lower andupper bounds on the optimal bottleneck value to restrict

ARTICLE IN PRESS

Fig 5 Nicolrsquos [27] algorithm (a) straightforward implementation (b) careful implementation with dynamic bottleneck-value bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996980

the search space for separator values as a preprocessingstep Natural lower and upper bounds for the optimalbottleneck value Bopt of a given CCP problem instanceethWNPTHORN are LB frac14 maxfB13wmaxg and UB frac14B13 thorn wmax respectively where B13 frac14 Wtot=P SincewmaxoB13 in coarse grain parallelization of most real-world applications our presentation will be forwmaxoB13 frac14 LB even though all results are valid whenB13 is replaced with maxfB13wmaxg The following lemmadescribes how to use these natural bounds on Bopt torestrict the search space for the separator values

Lemma 41 For a given CCP problem instance

ethWNPTHORN if Bf is a feasible bottleneck value in the

range frac12B13B13 thorn wmax then there exists a partition P frac14s0 s1y sPS of cost CethPTHORNpBf with SLppsppSHpfor 1ppoP where SLp and SHp are respectively the

smallest and largest indices such that

W1SLpXpethB13 wmaxethP pTHORN=PTHORN and

W1SHpppethB13 thorn wmaxethP pTHORN=PTHORN

Proof Let Bf frac14 B13 thorn w where 0pwowmax Partition Pcan be constructed by PROBEethBTHORN which loads the firstp processors as much as possible subject to LqpBf forq frac14 1 2y p In the worst case wspthorn1 frac14 wmax for eachof the first p processors Thus we have W1sp

Xf ethwTHORN frac14pethB13 thorn w wmaxTHORN for p frac14 1 2yP 1 However it

should be possible to divide the remaining subchainT spthorn1N into P p parts without exceeding Bf ieWspthorn1NpethP pTHORN ethB13 thorn wTHORN Thus we also haveW1sp

XgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn wTHORN Note that f ethwTHORNis an increasing function of w whereas gethwTHORN is adecreasing function of w The minimum ofmaxf f ethwTHORN gethwTHORNg is at the intersection of f ethwTHORN andgethwTHORN so that W1sp

XpethB13 wmaxethP pTHORN=PTHORNTo prove the upper bounds we can start with

W1sppf ethwTHORN frac14 pethB13 thorn wTHORN which holds when Lq frac14 B13 thorn

w for q frac14 1y p The condition Wspthorn1NXethP pTHORNethB13 thornw wmaxTHORN however ensures feasibility of Bf frac14 B13 thorn wsince PROBEethBTHORN can always load each of the remainingethP pTHORN processors with B13 thorn w wmax Thus we alsohave W1sp

pgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn w wmaxTHORNHere f ethwTHORN is an increasing function of w whereasgethwTHORN is a decreasing function of w which yieldsW1sp

ppethB13 thorn wmaxethP pTHORN=PTHORN amp

Corollary 42 The separator range weights are DWp frac14PSHp

ifrac14SLpwi frac14 W1SHp

W1SLpfrac14 2wmaxpethP pTHORN=P with a

maximum value Pwmax=2 at p frac14 P=2

Applying this corollary requires finding wmax whichentails an overhead equivalent to that of the prefix-sumoperation and hence should be avoided In this workwe adopt a practical scheme to construct the bounds onseparator indices We run the RB heuristic to find a

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 981

hopefully good bottleneck value BRB and use BRB as anupper bound for bottleneck values ie UB frac14 BRBThen we run LR-PROBEethBRBTHORN and RL-PROBEethBRBTHORNto construct two mappings P1 frac14 h10 h11y h1PS andP2 frac14 c20 c

21y c2PS with CethP1THORNCethP2THORNpBRB Here

LR-PROBE denotes the left-to-right probe given inFig 3 whereas RL-PROBE denotes a right-to-leftprobe function which can be considered as the dual ofthe LR-PROBE RL-PROBE exploits the greedy-choiceproperty from right to left That is RL-PROBE assignssubchains from the right end towards the left end of thetask chain to processors in the order PPPP1yP1From these two mappings lower and upper boundvalues for sp separator indices are constructed as SLp frac14c2p and SHp frac14 h1p respectively These bounds are furtherrefined by running LR-PROBEethB13THORN andRL-PROBEethB13THORN to construct two mappings P3 frac14c30 c

31y c3PS and P4 frac14 h40 h41y h4PS and then

defining SLp frac14 maxfSLp c3pg and SHp frac14 minfSHp h4pg

for 1ppoP Lemmas 43 and 44 prove correctness ofthese bounds

Lemma 43 For a given CCP problem instance ethWNPTHORNand a feasible bottleneck value Bf let P1 frac14h10 h11y h1PS and P2 frac14 c20 c

21y c2PS be the parti-

tions constructed by LR-PROBEethBfTHORN and RL-PROBEethBfTHORNrespectively Then any partition P frac14 s0 s1y sPS of cost

CethPTHORN frac14 BpBf satisfies c2ppspph1p

Proof By the property of LR-PROBEethBfTHORN h1p is thelargest index such that T 1h1p

can be partitioned into p

parts without exceeding Bf If sp4h1p then the bottle-neck value will exceed Bf and thus B By the property ofRL-PROBEethBfTHORN c2p is the smallest index where T c2pNcan be partitioned into P p parts without exceedingBf If spoc2p then the bottleneck value will exceed Bf

and thus B amp

Lemma 44 For a given CCP problem instance

ethWNPTHORN let P3 frac14 c30 c31y c3PS and P4 frac14

h40 h41y h4PS be the partitions constructed by

LR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value Bf there exists a

partition P frac14 s0 s1y sPS of cost CethPTHORNpBf that

satisfies c3ppspph4p

Proof Consider the partition P frac14 s0 s1y sPS con-structed by LR-PROBEethBfTHORN It is clear that thispartition already satisfies the lower bounds iespXc3p Assume sp4h4p then partition P0 obtained bymoving sp back to h4p also yields a partition with costCethP0THORNpBf since T h4pthorn1N can be partitioned into P p

parts without exceeding B13 amp

The difference between Lemmas 43 and 44 is that theformer ensures the existence of all partitions with

costpBf within the given separator-index rangeswhereas the latter ensures only the existence of at leastone such partition within the given ranges Thefollowing corollary combines the results of these twolemmas

Corollary 45 For a given CCP problem instance

ethWNPTHORN and a feasible bottleneck value Bf let

P1 frac14 h10 h11y h1PS P2 frac14 c20 c21y c2PS P3 frac14

c30 c31y c3PS and P4 frac14 h40 h

41y h4PS be the parti-

tions constructed by LR-PROBEethBfTHORN RL-PROBEethBfTHORNLR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value B in the range

frac12B13Bf there exists a partition P frac14 s0 s1y sPS of

cost CethPTHORNpB with SLppsppSHp for 1ppoP where

SLp frac14 maxfc2p c3pg and SHp frac14 minfh1p h4pg

Corollary 46 The separator range weights become

DWp frac14 2 minf p ethP pTHORNgwmax in the worst case with a

maximum value Pwmax at p frac14 P=2

Lemma 41 and Corollary 45 infer the followingtheorem since B13pBoptpB13 thorn wmax

Theorem 47 For a given CCP problem instance

ethWNPTHORN and SLp and SHp index bounds constructed

according to Lemma 41 or Corollary 45 there exists

an optimal partition Popt frac14 s0 s1y sPS with SLppsppSHp for 1ppoP

Comparison of separator range weights in Lemma 41and Corollary 45 shows that separator range weightsproduced by the practical scheme described inCorollary 45 may be worse than those of Lemma 41by a factor of two This is only the worst-case behaviorhowever and the practical scheme normally finds muchbetter bounds since order in the chain usually preventsthe worst-case behavior and BRBoB13 thorn wmax Experi-mental results in Section 6 justify this expectation

411 Complexity analysis models

Corollaries 42 and 46 give bounds on the weights ofthe separator-index ranges However we need boundson the sizes of these separator-index ranges forcomputational complexity analysis of the proposedCCP algorithms Here the size DSp frac14 SHp SLp thorn 1denotes the number of tasks within the pth rangefrac12SLpSHp Miguet and Pierson [24] propose the modelwi frac14 yethwavgTHORN for i frac14 1 2yN to prove that their H1and H2 heuristics allocate yethN=PTHORN tasks to eachprocessor where wavg frac14 Wtot=N is the average taskweight This assumption means that the weight of eachtask is not too far away from the average task weightUsing Corollaries 42 and 45 this model induces DSp frac14OethP wmax=wavgTHORN Moreover this model can be exploitedto induce the optimistic bound DSp frac14 OethPTHORN However

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996982

we find their model too restrictive since the minimumand maximum task weights can deviate substantiallyfrom wavg Here we establish a looser and more realisticmodel on task weights so that for any subchain T ij

with weight Wij sufficiently larger than wmax theaverage task weight within subchain T ij satisfiesOethwavgTHORN That is Dij frac14 j i thorn 1 frac14 OethWij=wavgTHORN Thismodel referred to here as model M directly inducesDSp frac14 OethPwmax=wavgTHORN since DWppDWP=2 frac14 Pwmax=2for p frac14 1 2yP 1

42 Dynamic-programming algorithm with static

separator-index bounding

The proposed DP algorithm referred to here as theDP+ algorithm exploits bounds on the separatorindices for an efficient solution Fig 6 illustrates theproposed DP+ algorithm where input parameters SL

and SH denote the index bound arrays each of size Pcomputed according to Corollary 45 with Bf frac14 BRBNote that SLP frac14 SHP frac14 N since only Bfrac12PN needs tobe computed in the last row As seen in Fig 6 only B

pj

values for j frac14 SLpSLp thorn 1ySHp are computed ateach row p by exploiting Corollary 45 which ensuresexistence of an optimal partition Popt frac14 s0 s1y sPSwith SLppsppSHp Thus these B

pj values will suffice

for correct computation of Bpthorn1i values for i frac14

SLpthorn1SLpthorn1 thorn 1ySHpthorn1 at the next row p thorn 1As seen in Fig 6 explicit range checking is avoided in

this algorithm for utmost efficiency However the j-index may proceed beyond SHp to SHp thorn 1 within therepeatndashuntil-loop while computing B

pthorn1i with

SLpthorn1pipSHpthorn1 in two cases In both cases functionsWjthorn1i and B

pj intersect in the open interval ethSHpSHp thorn

1THORN so that BpSHp

oWSHpthorn1i and BpSHpthorn1XWSHpthorn2i In the

first case i frac14 SHp thorn 1 so that Wjthorn1i and Bpj intersect in

Fig 6 Dynamic-programming algorithm with static separator-index

bounding

ethi 1 iTHORN which implies that Bpthorn1i frac14 Wi1i with j

pthorn1i frac14

SLp since Wi1ioBpi as mentioned in Section 32 In

the second case i4SHp thorn 1 for which Corollary 45guarantees that B

pthorn1i frac14 WSLpthorn1ipB

pSHpthorn1 and thus we

can safely select jpthorn1i frac14 SLp Note that WSLpthorn1i frac14

BpSHpthorn1 may correspond to a case leading to another

optimal partition with jpthorn1i frac14 spthorn1 frac14 SHp thorn 1 As seen in

Fig 6 both cases are efficiently resolved simply bystoring N to B

pSHpthorn1 as a sentinel Hence in such cases

the condition WSHpthorn1ioBpSHpthorn1 frac14 N in the ifndashthen

statement following the repeatndashuntil-loop is always trueso that the j-index automatically moves back to SHpThe scheme of computing B

pSHpthorn1 for each row p which

seems to be a natural solution does not work sincecorrect computation of B

pthorn1SHpthorn1thorn1 may necessitate more

than one Bpj value beyond the SHp index bound

A nice feature of the DP approach is that it can beused to generate all optimal partitions by maintaining aP N matrix to store the minimum j

pi index values

defining the Bpi values at the expense of increased

execution time and increased asymptotic space require-ment Recall that index bounds SL and SH computedaccording to Corollary 45 restrict the search space forat least one optimal solution The index bounds can becomputed according to Lemma 44 for this purposesince the search space restricted by Lemma 44 includesall optimal solutionsThe running time of the proposed DP+ algorithm is

OethN thorn P log NTHORN thornPP

pfrac141 yethDSpTHORN Here OethNTHORN cost comesfrom the initial prefix-sum operation on the W arrayand OethP log NTHORN cost comes from the running time of theRB heuristic and computing the separator-index boundsSL and SH according to Corollary 45 Under modelM DSp frac14 OethPwmax=wavgTHORN and hence the complexity isOethN thorn P log N thorn P2 wmax = wavgTHORN The algorithm be-comes linear in N when the separator-index ranges donot overlap which is guaranteed by the conditionwmax frac14 Oeth2Wtot=P2THORN

43 Iterative refinement algorithms

In this work we improve the MS algorithm andpropose a novel CCP algorithm namely the bidding

algorithm which is run-time efficient for small-to-medium number of processors The main differencebetween the MS and the bidding algorithms is asfollows the MS algorithm moves along a series offeasible bottleneck values whereas the bidding algo-rithm moves along a sequence of infeasible bottleneckvalues so that the first feasible bottleneck value becomesthe optimal value

431 Improving the MS algorithm

The performance of the MS algorithm stronglydepends on the initial partition The initial partitionproposed by Manne and S^revik [23] satisfies the leftist

ARTICLE IN PRESS

Fig 7 Bidding algorithm

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 983

partition constraint but it leads to very poor run-timeperformance Here we propose using the partitiongenerated by PROBEethB13THORN as an initial partition Thispartition is also a leftist partition since moving anyseparator to the left will not decrease the load of thebottleneck processor This simple observation leads tosignificant improvement in run-time performance of thealgorithm Also using a heap as a priority queue does notgive better run-time performance than using a runningmaximum despite its superior asymptotic complexity Inour implementation we use a running maximum

432 Bidding algorithm

This algorithm increases the bottleneck value gradu-ally starting from the ideal bottleneck value B13 until itfinds a feasible partition which is also optimal Considera partition Pt frac14 s0 s1y sPS constructed byPROBEethBtTHORN for an infeasible Bt After detecting theinfeasibility of this Bt value the point is to determine thenext larger bottleneck value B to be investigatedClearly the separator indices of the partitions to beconstructed by future PROBEethBTHORN calls with B4Bt willnever be to the left of the respective separator indices ofPt Moreover at least one of the separators shouldmove right for feasibility since the load of the lastprocessor determines infeasibility of the current Bt value(ie LP4Bt) To avoid missing the smallest feasiblebottleneck value the next larger B value is selected asthe minimum of processor loads that will be obtained bymoving the end-index of every processor to the right byone position That is the next larger B value is equal tominfmin1ppoP fLp thorn wspthorn1gLPg Here we call theLp thorn wspthorn1 value the bid of processor Pp which refersto the load of Pp if the first task tspthorn1 of the nextprocessor is augmented to Pp The bid of the lastprocessor PP is equal to the load of the remaining tasksIf the smallest bid B comes from processor Pb probingwith new B is performed only for the remainingprocessors PbPbthorn1yPPS in the suffix Wsb1thorn1Nof the W arrayThe bidding algorithm is presented in Fig 7 The

innermost whilendashloop implements a linear probingscheme such that the new positions of the separatorsare determined by moving them to the right one by oneThis linear probing scheme is selected because newpositions of separators are likely to be in a closeneighborhood of previous ones Note that binary searchis used only for setting the separator indices for the firsttime After the separator index sp is set for processor Pp

during linear probing the repeatndashuntil-loop terminates ifit is not possible to partition the remaining subchainT spthorn1N into P p processors without exceeding thecurrent B value ie rbid frac14 Lr=ethP pTHORN4B where Lr

denotes the weight of the remaining subchain In thiscase the next larger B value is determined by consider-ing the best bid among the first p processors and rbid

As seen in Fig 7 we maintain a prefix-minimumarray BIDS for computing the next larger B value HereBIDS is an array of records of length P whereBIDSfrac12 pB and BIDSfrac12 pb store the best bid value ofthe first p processors and the index of the definingprocessor respectively BIDSfrac120 helps the correctness ofthe running prefix-minimum operationThe complexity of the bidding algorithm for integer

task weights under model M is OethN thorn P log N thornP wmax thorn P2ethwmax=wavgTHORNTHORN Here OethNTHORN cost comes fromthe initial prefix-sum operation on the W array andOethP log NTHORN cost comes from initial settings of separatorsthrough binary search The B value is increased at mostBopt B13owmax times and each time the next B valuecan be computed in OethPTHORN time which induces the costOethP wmaxTHORN The total area scanned by the separators is atmost OethP2ethwmax=wavgTHORNTHORN For noninteger task weightscomplexity can reach OethP log N thorn P3ethwmax=wavgTHORNTHORN inthe worst case which occurs when only one separatorindex moves to the right by one position at each B valueWe should note here that using a min-heap for findingthe next B value enables terminating a repeat-loopiteration as soon as a separator-index does not moveThe trade-off in this scheme is the Oethlog PTHORN cost incurredat each separator-index move due to respectivekey-update operations on the heap We implementedthis scheme as well but observed increased executiontimes

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Table 1

The summary of important abbreviations and symbols

Notation Explanation

N number of tasks

P number of processors

P processor chain

Pi ith processor in the processor chain

T task chain ie T frac14 t1 t2y tNSti ith task in the task chain

T ij subchain of tasks starting from ti upto tj ie T ij frac14ti tithorn1y tjS

T pi subproblem of p-way partitioning of the first i tasks in

the task chain T

wi computational load of task ti

wmax maximum computational load among all tasks

wavg average computational load of all tasks

Wij total computational load of task subchain Tij

Wtot total computational load

Wfrac12i total weight of the first i tasks

Ppi partition of first i tasks in the task chain onto the first p

processors in the processor chain

Lp load of the pth processor in a partition

UB upperbound on the value of an optimal solution

LB lower bound on the value of an optimal solution

B13 ideal bottleneck value achieved when all processors

have equal load

Bpi optimal solution value for p-way partitioning of the

first i tasks

sp index of the last task assigned to the pth processor

SLp lowest position for the pth separator index in an

optimal solution

SHp highest position for the pth separator index in an

optimal solution

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996976

subchains to contiguous processors Hence a P-waychain-partition PP

N of a task chain T with N tasks ontoa processor chain P with P processors is described by asequence PP

N frac14 s0 s1 s2y sPS of P thorn 1 separatorindices where s0 frac14 0ps1ppsP frac14 N Here sp de-notes the index of the last task of the pth part so that Pp

gets the subchain T sp1thorn1spwith load Lp frac14 Wsp1thorn1sp

The cost CethPTHORN of a partition P is determined by themaximum processor execution time among all proces-sors ie CethPTHORN frac14 B frac14 max1pppP fLpg This B value of apartition is called its bottleneck value and the processorpart defining it is called the bottleneck processorpartThe CCP problem can be defined as finding a mappingPopt that minimizes the bottleneck value Bopt frac14 CethPoptTHORN

3 Previous work

Each CCP algorithm discussed in this section andSection 4 involves an initial prefix-sum operation on thetask-weight array W to enhance the efficiency ofsubsequent subchain-weight computations The prefix-sum operation replaces the ith entry Wfrac12i with the sumof the first i entries eth

Pihfrac141 whTHORN so that computational

load Wij of a subchain T ij can be efficiently determined

as Wfrac12 j Wfrac12i 1 in Oeth1THORN-time In our discussions Wis used to refer to the prefix-summed W-array and theyethNTHORN cost of this initial prefix-sum operation isconsidered in the complexity analysis The presentationsfocus only on finding the bottleneck value Bopt becausea corresponding optimal mapping can be constructedeasily by making a PROBEethBoptTHORN call as discussed inSection 34

31 Heuristics

The most commonly used partitioning heuristic isbased on recursive bisection ethRBTHORN RB achieves P-waypartitioning through log P bisection levels where P is apower of 2 At each bisection step in a level the currentchain is divided evenly into two subchains Althoughoptimal division can be easily achieved at every bisectionstep the sequence of optimal bisections may lead topoor load balancing RB can be efficiently implementedin OethN thorn P log NTHORN time by first performing a prefix-sumoperation on the workload array W with complexityOethNTHORN and then making P 1 binary searches in theprefix-summed W-array each with complexityOethlog NTHORNMiguet and Pierson [24] proposed two other heur-

istics The first heuristic ethH1THORN computes the separatorvalues such that sp is the largest index such thatW1sp

ppB13 where B13 frac14 Wtot=P is the ideal bottleneckvalue and Wtot frac14

PNifrac141 wi denotes sum of all task

weights The second heuristic ethH2THORN further refines theseparator indices by incrementing each sp value found inH1 if ethW1spthorn1 pB13THORNoeth pB13 W1sp

THORN These two heur-istics can also be implemented in OethN thorn P log NTHORN timeby performing P 1 binary searches in the prefix-summed W-array Miguet and Pierson [24] have alreadyproved the upper bounds on the bottleneck values of thepartitions found by H1 and H2 as BH1BH2oB13 thornwmax where wmax frac14 max1pppN fwig denotes the max-imum task weight The following lemma establishes asimilar bound for the RB heuristic

Lemma 31 Let PRB frac14 s0 s1y sPS be an RB solu-

tion to a CCP problem ethWNPTHORN Then BRB frac14 CethPRBTHORNsatisfies BRBpB13 thorn wmaxethP 1THORN=P

Proof Consider the first bisection step There exists apivot index 1pi1pN such that both sides weigh lessthan Wtot=2 without the i1th task and more than Wtot=2with it That is

W1i11Wi1thorn1NpWtot=2pW1i1 Wi1N

The worst case for RB occurs when wi1 frac14 wmax andW1i11 frac14 Wi1thorn1N frac14 ethWtot wmaxTHORN=2 Without loss ofgenerality assume that ti1 is assigned to the left part sothat sP=2 frac14 i1 and W1sP=2

frac14 Wtot=2thorn wmax=2 In asimilar worst-case bisection of T 1sP=2

there exists an

ARTICLE IN PRESS

Fig 1 OethethN PTHORNPTHORN-time dynamic-programming algorithm proposed

by Choi and Narahari [6] and Olstad and Manne [30]

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 977

index i2 such that wi2 frac14 wmax and W1i21 frac14 Wi2thorn1sP=2frac14

ethWtot wmaxTHORN=4 and ti2 is assigned to the left part sothat sP=4 frac14 i2 and W1sP=4

frac14 ethWtot wmaxTHORN=4thorn wmax frac14Wtot=4thorn eth3=4THORNwmax For a sequence of log P such worst-case bisection steps on the left parts processor P1 will bethe bottleneck processor with load BRB frac14 W1s1 frac14Wtot=P thorn wmaxethP 1THORN=P amp

32 Dynamic programming

The overlapping subproblem space can be defined asT p

i for p frac14 1 2yP and i frac14 p p thorn 1yN P thorn pwhere T p

i denotes a p-way CCP of prefix task-subchainT 1i frac14 t1 t2y tiS onto prefix processor-subchainP1p frac14 P1P2yPpS Notice that index i is restrictedto ppipN P thorn p range because there is no merit inleaving a processor empty From this subproblem spacedefinition the optimal substructure property of the CCPproblem can be shown by considering an optimalmapping Pp

i frac14 s0 s1y sp frac14 iS with a bottleneckvalue B

pi for the CCP subproblem T p

i If the lastprocessor is not the bottleneck processor in Pp

i thenPp1

sp1frac14 s0 s1y sp1S should be an optimal mapping

for the subproblem T p1sp1

Hence recursive definition forthe bottleneck value of an optimal mapping is

Bpi frac14 min

p1pjoifmaxfB

p1j Wjthorn1igg eth1THORN

In (1) searching for index j corresponds to searching forseparator sp1 so that the remaining subchain T jthorn1i isassigned to the last processor Pp in an optimal mappingPp

i of T pi The bottleneck value BP

N of an optimalmapping can be computed using (1) in a bottom-upfashion starting from B1

i frac14 W1i for i frac14 1 2yN Aninitial prefix-sum on the workload array W enablesconstant-time computation of subchain weight of theform Wjthorn1i through Wjthorn1i frac14 Wfrac12i Wfrac12 j ComputingB

pi using (1) takes OethN pTHORN time for each i and p and

thus the algorithm takes OethethN PTHORN2PTHORN time since thenumber of distinct subproblems is equal to ethN P thorn 1THORNPChoi and Narahari [6] and Olstad and Manne [30]

reduced the complexity of this scheme to OethNPTHORN andOethethN PTHORNPTHORN respectively by exploiting the followingobservations that hold for positive task weights For afixed p in (1) the minimum index value j

pi defining B

pi

cannot occur at a value less than the minimum indexvalue j

pi1 defining B

pi1 ie j

pi1pj

pi pethi 1THORN Hence the

search for the optimal jpi can start from j

pi1 In (1) B

p1j

for a fixed p is a nondecreasing function of j and Wjthorn1ifor a fixed i is a decreasing function of j reducing to 0 atj frac14 i Thus two cases occur in a semi-closed intervalfrac12 j

pi1 iTHORN for j If Wjthorn1i4B

p1j initially then these two

functions intersect in frac12 jpi1 iTHORN In this case the search for

jpi continues until Wj13thorn1ipB

p1j and then only j13 and

j13 1 are considered for setting jpi with j

pi frac14 j13 if

Bp1j13 pWji and j

pi frac14 j13 1 otherwise Note that this

scheme automatically detects jpi frac14 i 1 if Wjthorn1i and

Bp1j intersect in the open interval ethi 1 iTHORN However if

Wjthorn1ipBp1j initially then B

p1j lies above Wjthorn1i in the

closed interval frac12 jpi1 i In this case the minimum value

occurs at the first value of j ie jpi frac14 j

pi1 These

improvements lead to an OethethN PTHORNPTHORN-time algorithmsince computation of all B

pi values for a fixed p makes

OethN PTHORN references to already computed Bp1j values

Fig 1 displays a run-time efficient implementation ofthis OethethN PTHORNPTHORN-time DP algorithm which avoids theexplicit minndashmax operation required in (1) In Fig 1 B

pi

values are stored in a table whose entries are computedin row-major order

33 Iterative refinement

The algorithm proposed by Manne and S^revik [23]referred to here as the MS algorithm finds a sequence ofnonoptimal partitions such that there is only one wayeach partition can be improved For this purpose theyintroduce the leftist partition (LP) Consider a partitionP such that Pp is the leftmost processor containing atleast two tasks P is defined as an LP if increasing theload of any processor Pc that lies to the right of Pp byaugmenting the last task of Pc1 to Pc makes Pc abottleneck processor with a load XCethPTHORN Let P be anLP with bottleneck processor Pb and bottleneck value BIf Pb contains only one task then P is optimal On theother hand assume that Pb contains at least two tasksThe refinement step which is shown by the inner whilendashloop in Fig 2 tries to find a new LP of lower cost bysuccessively removing the first task of Pp and augment-ing it to Pp1 for p frac14 b b 1y until LpoB Refine-ment fails when the whilendashloop proceeds until p frac14 1 withLpXB Manne and S^revik proved that a successfulrefinement of an LP gives a new LP and the LP isoptimal if the refinement fails They proposed using aninitial LP in which the P 1 leftmost processors each

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996978

has only one task and the last processor contains theremaining tasks The MS algorithm moves eachseparator index at most N P times so that the totalnumber of separator-index moves is OethPethN PTHORNTHORN Amax-heap is maintained for the processor loads to find abottleneck processor at the beginning of each repeat-loop iteration The cost of each separator-index move isOethlog P) since it necessitates one decrease-key and oneincrease-key operations Thus the complexity of the MSalgorithm is OethPethN PTHORN log PTHORN

34 Parametric search

The parametric-search approach relies on repeatedprobing for a partition P with a bottleneck value no

Fig 2 Iterative refinement algorithm proposed by Manne and S^revik

[23]

Fig 3 (a) Standard probe algorithm with OethP logNTHORN complexity (b) O

greater than a given B value Probe algorithms exploitthe greedy-choice property for existence and construc-tion of P The greedy choice here is to minimizeremaining work after loading processor Pp subject toLppB for p frac14 1yP 1 in order PROBEethBTHORN func-tions given in Fig 3 exploit this greedy property asfollows PROBE finds the largest index s1 so thatW1s1pB and assigns subchain T 1s1 to processor P1

with load L1 frac14 W1s1 Hence the first task in the secondprocessor is ts1thorn1 PROBE then similarly finds thelargest index s2 so that Ws1thorn1s2pB and assigns thesubchain T s1thorn1s2 to processor P2 This process continuesuntil either all tasks are assigned or all processors areexhausted The former and latter cases indicate feasi-bility and infeasibility of B respectivelyFig 3(a) illustrates the standard probe algorithm The

indices s1 s2y sP1 are efficiently found throughbinary search (BINSRCH) on the prefix-summed W-array In this figure BINSRCHethW iNBsumTHORN searchesW in the index range frac12iN to compute the index ipjpN

such that Wfrac12 jpBsum and Wfrac12 j thorn 14Bsum Thecomplexity of the standard probe algorithm isOethP log NTHORN Han et al [12] proposed an OethP log N=PTHORN-time probe algorithm (see Fig 3(b)) exploiting P

repeated binary searches on the same W arraywith increasing search values Their algorithm dividesthe chain into P subchains of equal length Ateach probe a linear search is performed on theweights of the last tasks of these P subchains to findout in which subchain the search value could beand then binary search is performed on the respectivesubchain of length N=P Note that since the probesearch values always increase linear search can beperformed incrementally that is search continuesfrom the last subchain that was searched to the rightwith OethPTHORN total cost This gives a total cost ofOethP logethN=PTHORNTHORN for P binary searches thus for the probefunction

ethP logethN=PTHORNTHORN-time probe algorithm proposed by Han et al [12]

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 979

341 Bisection as an approximation algorithm

Let f ethBTHORN be the binary-valued function where f ethBTHORN frac141 if PROBEethBTHORN is true and f ethBTHORN frac14 0 if PROBEethBTHORN isfalse Clearly f ethBTHORN is nondecreasing in B and Bopt liesbetween LB frac14 B13 frac14 Wtot=P and UB frac14 Wtot Theseobservations are exploited in the bisection algorithmleading to an efficient e-approximate algorithm where eis the desired precision The interval frac12Wtot=PWtot isconceptually discretized into ethWtot Wtot=PTHORN=e bottle-neck values and binary search is used in this range tofind the minimum feasible bottleneck value Bopt Thebisection algorithm as illustrated in Fig 4 performsOethlogethWtot=eTHORNTHORN PROBE calls and each PROBE callcosts OethP logethN=PTHORNTHORN Hence the bisection algorithmruns in OethN thorn P logethN=PTHORN logethWtot=eTHORNTHORN time whereOethNTHORN cost comes from the initial prefix-sum operationon W The performance of this algorithm deteriorateswhen logethWtot=eTHORN is comparable with N

342 Nicolrsquos algorithm

Nicolrsquos algorithm [27] exploits the fact that anycandidate B value is equal to weight Wij of a subchainA naive solution is to generate all subchain weights ofthe form Wij sort them and then use binary search tofind the minimum Wab value for which PROBEethWabTHORN frac14TRUE Nicolrsquos algorithm efficiently searches for theearliest range Wab for which Bopt frac14 Wab by consideringeach processor in order as a candidate bottleneckprocessor in an optimal mapping Let Popt be theoptimal mapping constructed by greedy PROBEethBoptTHORNand let processor Pb be the first bottleneck processorwith load Bopt in Popt frac14 s0 s1y sby sPS Underthese assumptions this greedy construction of Popt

ensures that each processor Pp preceding Pb is loadedas much as possible with LpoBopt for p frac14 1 2y b 1in Popt Here PROBEethLpTHORN frac14 FALSE since LpoBoptand PROBEethLp thorn Wspthorn1THORN frac14 TRUE since adding onemore task to processor Pp increases its load to Lp thornwspthorn14Bopt Hence if b frac14 1 then s1 is equal to thesmallest index i1 such that PROBEethW1i1THORN frac14 TRUE and

Fig 4 Bisection as an e-approximation algorithm

Bopt frac14 B1 frac14 W1s1 However if b41 then because of thegreedy choice property P1 should be loaded as much aspossible without exceeding Bopt frac14 BboB1 which impliesthat s1 frac14 i1 1 and hence L1 frac14 W1i11 If b frac14 2 then s2is equal to the smallest index i2 such thatPROBEethWi1i2THORN frac14 TRUE and Bopt frac14 B2 frac14 Wi1i2 Ifb42 then s2 frac14 i2 1 We iterate for b frac14 1 2yP 1 computing ib as the smallest index for whichPROBEethWib1ibTHORN frac14 TRUE and save Bb frac14 Wib1ib withiP frac14 N Finally the optimal bottleneck value is selectedas Bopt frac14 min1pbpP BbFig 5 illustrates Nicolrsquos algorithm As seen in this

figure given ib1 ib is found by performing a binarysearch over all subchain weights of the form Wib1j forib1pjpN in the bth iteration of the forndashloop HenceNicolrsquos algorithm performs Oethlog NTHORN PROBE calls tofind ib at iteration b and each probe call costsOethP logethN=PTHORNTHORN Thus the cost of computing anindividual Bb value is OethP log N logethN=PTHORNTHORN Since P 1 such Bb values are computed the overall complexity ofNicolrsquos algorithm is OethN thorn P2 log N logethN=PTHORNTHORN whereOethNTHORN cost comes from the initial prefix-sum operationon WTwo possible implementations of Nicolrsquos algorithm

are presented in Fig 5 Fig 5(a) illustrates a straightfor-ward implementation whereas Fig 5(b) illustrates acareful implementation which maintains and uses theinformation from previous probes to answer withoutcalling the PROBE function As seen in Fig 5(b) thisinformation is efficiently maintained as an undeterminedbottleneck-value range ethLBUBTHORN which is dynamicallyrefined after each probe Any bottleneck value encoun-tered outside the current range is immediately acceptedor rejected Although this simple scheme does notimprove the asymptotic complexity of the algorithm itdrastically reduces the number of probes as discussed inSection 6

4 Proposed CCP algorithms

In this section we present proposed methods Firstwe describe how to bound the separator indices for anoptimal solution to reduce the search space Then weshow how this technique can be used to improve theperformance of the dynamic programming algorithmWe continue with our discussion on improving the MSalgorithm and propose a novel refinement algorithmwhich we call the bidding algorithm Finally we discussparametric search methods proposing improvementsfor the bisection and Nicolrsquos algorithms

41 Restricting the search space

Our proposed CCP algorithms exploit lower andupper bounds on the optimal bottleneck value to restrict

ARTICLE IN PRESS

Fig 5 Nicolrsquos [27] algorithm (a) straightforward implementation (b) careful implementation with dynamic bottleneck-value bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996980

the search space for separator values as a preprocessingstep Natural lower and upper bounds for the optimalbottleneck value Bopt of a given CCP problem instanceethWNPTHORN are LB frac14 maxfB13wmaxg and UB frac14B13 thorn wmax respectively where B13 frac14 Wtot=P SincewmaxoB13 in coarse grain parallelization of most real-world applications our presentation will be forwmaxoB13 frac14 LB even though all results are valid whenB13 is replaced with maxfB13wmaxg The following lemmadescribes how to use these natural bounds on Bopt torestrict the search space for the separator values

Lemma 41 For a given CCP problem instance

ethWNPTHORN if Bf is a feasible bottleneck value in the

range frac12B13B13 thorn wmax then there exists a partition P frac14s0 s1y sPS of cost CethPTHORNpBf with SLppsppSHpfor 1ppoP where SLp and SHp are respectively the

smallest and largest indices such that

W1SLpXpethB13 wmaxethP pTHORN=PTHORN and

W1SHpppethB13 thorn wmaxethP pTHORN=PTHORN

Proof Let Bf frac14 B13 thorn w where 0pwowmax Partition Pcan be constructed by PROBEethBTHORN which loads the firstp processors as much as possible subject to LqpBf forq frac14 1 2y p In the worst case wspthorn1 frac14 wmax for eachof the first p processors Thus we have W1sp

Xf ethwTHORN frac14pethB13 thorn w wmaxTHORN for p frac14 1 2yP 1 However it

should be possible to divide the remaining subchainT spthorn1N into P p parts without exceeding Bf ieWspthorn1NpethP pTHORN ethB13 thorn wTHORN Thus we also haveW1sp

XgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn wTHORN Note that f ethwTHORNis an increasing function of w whereas gethwTHORN is adecreasing function of w The minimum ofmaxf f ethwTHORN gethwTHORNg is at the intersection of f ethwTHORN andgethwTHORN so that W1sp

XpethB13 wmaxethP pTHORN=PTHORNTo prove the upper bounds we can start with

W1sppf ethwTHORN frac14 pethB13 thorn wTHORN which holds when Lq frac14 B13 thorn

w for q frac14 1y p The condition Wspthorn1NXethP pTHORNethB13 thornw wmaxTHORN however ensures feasibility of Bf frac14 B13 thorn wsince PROBEethBTHORN can always load each of the remainingethP pTHORN processors with B13 thorn w wmax Thus we alsohave W1sp

pgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn w wmaxTHORNHere f ethwTHORN is an increasing function of w whereasgethwTHORN is a decreasing function of w which yieldsW1sp

ppethB13 thorn wmaxethP pTHORN=PTHORN amp

Corollary 42 The separator range weights are DWp frac14PSHp

ifrac14SLpwi frac14 W1SHp

W1SLpfrac14 2wmaxpethP pTHORN=P with a

maximum value Pwmax=2 at p frac14 P=2

Applying this corollary requires finding wmax whichentails an overhead equivalent to that of the prefix-sumoperation and hence should be avoided In this workwe adopt a practical scheme to construct the bounds onseparator indices We run the RB heuristic to find a

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 981

hopefully good bottleneck value BRB and use BRB as anupper bound for bottleneck values ie UB frac14 BRBThen we run LR-PROBEethBRBTHORN and RL-PROBEethBRBTHORNto construct two mappings P1 frac14 h10 h11y h1PS andP2 frac14 c20 c

21y c2PS with CethP1THORNCethP2THORNpBRB Here

LR-PROBE denotes the left-to-right probe given inFig 3 whereas RL-PROBE denotes a right-to-leftprobe function which can be considered as the dual ofthe LR-PROBE RL-PROBE exploits the greedy-choiceproperty from right to left That is RL-PROBE assignssubchains from the right end towards the left end of thetask chain to processors in the order PPPP1yP1From these two mappings lower and upper boundvalues for sp separator indices are constructed as SLp frac14c2p and SHp frac14 h1p respectively These bounds are furtherrefined by running LR-PROBEethB13THORN andRL-PROBEethB13THORN to construct two mappings P3 frac14c30 c

31y c3PS and P4 frac14 h40 h41y h4PS and then

defining SLp frac14 maxfSLp c3pg and SHp frac14 minfSHp h4pg

for 1ppoP Lemmas 43 and 44 prove correctness ofthese bounds

Lemma 43 For a given CCP problem instance ethWNPTHORNand a feasible bottleneck value Bf let P1 frac14h10 h11y h1PS and P2 frac14 c20 c

21y c2PS be the parti-

tions constructed by LR-PROBEethBfTHORN and RL-PROBEethBfTHORNrespectively Then any partition P frac14 s0 s1y sPS of cost

CethPTHORN frac14 BpBf satisfies c2ppspph1p

Proof By the property of LR-PROBEethBfTHORN h1p is thelargest index such that T 1h1p

can be partitioned into p

parts without exceeding Bf If sp4h1p then the bottle-neck value will exceed Bf and thus B By the property ofRL-PROBEethBfTHORN c2p is the smallest index where T c2pNcan be partitioned into P p parts without exceedingBf If spoc2p then the bottleneck value will exceed Bf

and thus B amp

Lemma 44 For a given CCP problem instance

ethWNPTHORN let P3 frac14 c30 c31y c3PS and P4 frac14

h40 h41y h4PS be the partitions constructed by

LR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value Bf there exists a

partition P frac14 s0 s1y sPS of cost CethPTHORNpBf that

satisfies c3ppspph4p

Proof Consider the partition P frac14 s0 s1y sPS con-structed by LR-PROBEethBfTHORN It is clear that thispartition already satisfies the lower bounds iespXc3p Assume sp4h4p then partition P0 obtained bymoving sp back to h4p also yields a partition with costCethP0THORNpBf since T h4pthorn1N can be partitioned into P p

parts without exceeding B13 amp

The difference between Lemmas 43 and 44 is that theformer ensures the existence of all partitions with

costpBf within the given separator-index rangeswhereas the latter ensures only the existence of at leastone such partition within the given ranges Thefollowing corollary combines the results of these twolemmas

Corollary 45 For a given CCP problem instance

ethWNPTHORN and a feasible bottleneck value Bf let

P1 frac14 h10 h11y h1PS P2 frac14 c20 c21y c2PS P3 frac14

c30 c31y c3PS and P4 frac14 h40 h

41y h4PS be the parti-

tions constructed by LR-PROBEethBfTHORN RL-PROBEethBfTHORNLR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value B in the range

frac12B13Bf there exists a partition P frac14 s0 s1y sPS of

cost CethPTHORNpB with SLppsppSHp for 1ppoP where

SLp frac14 maxfc2p c3pg and SHp frac14 minfh1p h4pg

Corollary 46 The separator range weights become

DWp frac14 2 minf p ethP pTHORNgwmax in the worst case with a

maximum value Pwmax at p frac14 P=2

Lemma 41 and Corollary 45 infer the followingtheorem since B13pBoptpB13 thorn wmax

Theorem 47 For a given CCP problem instance

ethWNPTHORN and SLp and SHp index bounds constructed

according to Lemma 41 or Corollary 45 there exists

an optimal partition Popt frac14 s0 s1y sPS with SLppsppSHp for 1ppoP

Comparison of separator range weights in Lemma 41and Corollary 45 shows that separator range weightsproduced by the practical scheme described inCorollary 45 may be worse than those of Lemma 41by a factor of two This is only the worst-case behaviorhowever and the practical scheme normally finds muchbetter bounds since order in the chain usually preventsthe worst-case behavior and BRBoB13 thorn wmax Experi-mental results in Section 6 justify this expectation

411 Complexity analysis models

Corollaries 42 and 46 give bounds on the weights ofthe separator-index ranges However we need boundson the sizes of these separator-index ranges forcomputational complexity analysis of the proposedCCP algorithms Here the size DSp frac14 SHp SLp thorn 1denotes the number of tasks within the pth rangefrac12SLpSHp Miguet and Pierson [24] propose the modelwi frac14 yethwavgTHORN for i frac14 1 2yN to prove that their H1and H2 heuristics allocate yethN=PTHORN tasks to eachprocessor where wavg frac14 Wtot=N is the average taskweight This assumption means that the weight of eachtask is not too far away from the average task weightUsing Corollaries 42 and 45 this model induces DSp frac14OethP wmax=wavgTHORN Moreover this model can be exploitedto induce the optimistic bound DSp frac14 OethPTHORN However

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996982

we find their model too restrictive since the minimumand maximum task weights can deviate substantiallyfrom wavg Here we establish a looser and more realisticmodel on task weights so that for any subchain T ij

with weight Wij sufficiently larger than wmax theaverage task weight within subchain T ij satisfiesOethwavgTHORN That is Dij frac14 j i thorn 1 frac14 OethWij=wavgTHORN Thismodel referred to here as model M directly inducesDSp frac14 OethPwmax=wavgTHORN since DWppDWP=2 frac14 Pwmax=2for p frac14 1 2yP 1

42 Dynamic-programming algorithm with static

separator-index bounding

The proposed DP algorithm referred to here as theDP+ algorithm exploits bounds on the separatorindices for an efficient solution Fig 6 illustrates theproposed DP+ algorithm where input parameters SL

and SH denote the index bound arrays each of size Pcomputed according to Corollary 45 with Bf frac14 BRBNote that SLP frac14 SHP frac14 N since only Bfrac12PN needs tobe computed in the last row As seen in Fig 6 only B

pj

values for j frac14 SLpSLp thorn 1ySHp are computed ateach row p by exploiting Corollary 45 which ensuresexistence of an optimal partition Popt frac14 s0 s1y sPSwith SLppsppSHp Thus these B

pj values will suffice

for correct computation of Bpthorn1i values for i frac14

SLpthorn1SLpthorn1 thorn 1ySHpthorn1 at the next row p thorn 1As seen in Fig 6 explicit range checking is avoided in

this algorithm for utmost efficiency However the j-index may proceed beyond SHp to SHp thorn 1 within therepeatndashuntil-loop while computing B

pthorn1i with

SLpthorn1pipSHpthorn1 in two cases In both cases functionsWjthorn1i and B

pj intersect in the open interval ethSHpSHp thorn

1THORN so that BpSHp

oWSHpthorn1i and BpSHpthorn1XWSHpthorn2i In the

first case i frac14 SHp thorn 1 so that Wjthorn1i and Bpj intersect in

Fig 6 Dynamic-programming algorithm with static separator-index

bounding

ethi 1 iTHORN which implies that Bpthorn1i frac14 Wi1i with j

pthorn1i frac14

SLp since Wi1ioBpi as mentioned in Section 32 In

the second case i4SHp thorn 1 for which Corollary 45guarantees that B

pthorn1i frac14 WSLpthorn1ipB

pSHpthorn1 and thus we

can safely select jpthorn1i frac14 SLp Note that WSLpthorn1i frac14

BpSHpthorn1 may correspond to a case leading to another

optimal partition with jpthorn1i frac14 spthorn1 frac14 SHp thorn 1 As seen in

Fig 6 both cases are efficiently resolved simply bystoring N to B

pSHpthorn1 as a sentinel Hence in such cases

the condition WSHpthorn1ioBpSHpthorn1 frac14 N in the ifndashthen

statement following the repeatndashuntil-loop is always trueso that the j-index automatically moves back to SHpThe scheme of computing B

pSHpthorn1 for each row p which

seems to be a natural solution does not work sincecorrect computation of B

pthorn1SHpthorn1thorn1 may necessitate more

than one Bpj value beyond the SHp index bound

A nice feature of the DP approach is that it can beused to generate all optimal partitions by maintaining aP N matrix to store the minimum j

pi index values

defining the Bpi values at the expense of increased

execution time and increased asymptotic space require-ment Recall that index bounds SL and SH computedaccording to Corollary 45 restrict the search space forat least one optimal solution The index bounds can becomputed according to Lemma 44 for this purposesince the search space restricted by Lemma 44 includesall optimal solutionsThe running time of the proposed DP+ algorithm is

OethN thorn P log NTHORN thornPP

pfrac141 yethDSpTHORN Here OethNTHORN cost comesfrom the initial prefix-sum operation on the W arrayand OethP log NTHORN cost comes from the running time of theRB heuristic and computing the separator-index boundsSL and SH according to Corollary 45 Under modelM DSp frac14 OethPwmax=wavgTHORN and hence the complexity isOethN thorn P log N thorn P2 wmax = wavgTHORN The algorithm be-comes linear in N when the separator-index ranges donot overlap which is guaranteed by the conditionwmax frac14 Oeth2Wtot=P2THORN

43 Iterative refinement algorithms

In this work we improve the MS algorithm andpropose a novel CCP algorithm namely the bidding

algorithm which is run-time efficient for small-to-medium number of processors The main differencebetween the MS and the bidding algorithms is asfollows the MS algorithm moves along a series offeasible bottleneck values whereas the bidding algo-rithm moves along a sequence of infeasible bottleneckvalues so that the first feasible bottleneck value becomesthe optimal value

431 Improving the MS algorithm

The performance of the MS algorithm stronglydepends on the initial partition The initial partitionproposed by Manne and S^revik [23] satisfies the leftist

ARTICLE IN PRESS

Fig 7 Bidding algorithm

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 983

partition constraint but it leads to very poor run-timeperformance Here we propose using the partitiongenerated by PROBEethB13THORN as an initial partition Thispartition is also a leftist partition since moving anyseparator to the left will not decrease the load of thebottleneck processor This simple observation leads tosignificant improvement in run-time performance of thealgorithm Also using a heap as a priority queue does notgive better run-time performance than using a runningmaximum despite its superior asymptotic complexity Inour implementation we use a running maximum

432 Bidding algorithm

This algorithm increases the bottleneck value gradu-ally starting from the ideal bottleneck value B13 until itfinds a feasible partition which is also optimal Considera partition Pt frac14 s0 s1y sPS constructed byPROBEethBtTHORN for an infeasible Bt After detecting theinfeasibility of this Bt value the point is to determine thenext larger bottleneck value B to be investigatedClearly the separator indices of the partitions to beconstructed by future PROBEethBTHORN calls with B4Bt willnever be to the left of the respective separator indices ofPt Moreover at least one of the separators shouldmove right for feasibility since the load of the lastprocessor determines infeasibility of the current Bt value(ie LP4Bt) To avoid missing the smallest feasiblebottleneck value the next larger B value is selected asthe minimum of processor loads that will be obtained bymoving the end-index of every processor to the right byone position That is the next larger B value is equal tominfmin1ppoP fLp thorn wspthorn1gLPg Here we call theLp thorn wspthorn1 value the bid of processor Pp which refersto the load of Pp if the first task tspthorn1 of the nextprocessor is augmented to Pp The bid of the lastprocessor PP is equal to the load of the remaining tasksIf the smallest bid B comes from processor Pb probingwith new B is performed only for the remainingprocessors PbPbthorn1yPPS in the suffix Wsb1thorn1Nof the W arrayThe bidding algorithm is presented in Fig 7 The

innermost whilendashloop implements a linear probingscheme such that the new positions of the separatorsare determined by moving them to the right one by oneThis linear probing scheme is selected because newpositions of separators are likely to be in a closeneighborhood of previous ones Note that binary searchis used only for setting the separator indices for the firsttime After the separator index sp is set for processor Pp

during linear probing the repeatndashuntil-loop terminates ifit is not possible to partition the remaining subchainT spthorn1N into P p processors without exceeding thecurrent B value ie rbid frac14 Lr=ethP pTHORN4B where Lr

denotes the weight of the remaining subchain In thiscase the next larger B value is determined by consider-ing the best bid among the first p processors and rbid

As seen in Fig 7 we maintain a prefix-minimumarray BIDS for computing the next larger B value HereBIDS is an array of records of length P whereBIDSfrac12 pB and BIDSfrac12 pb store the best bid value ofthe first p processors and the index of the definingprocessor respectively BIDSfrac120 helps the correctness ofthe running prefix-minimum operationThe complexity of the bidding algorithm for integer

task weights under model M is OethN thorn P log N thornP wmax thorn P2ethwmax=wavgTHORNTHORN Here OethNTHORN cost comes fromthe initial prefix-sum operation on the W array andOethP log NTHORN cost comes from initial settings of separatorsthrough binary search The B value is increased at mostBopt B13owmax times and each time the next B valuecan be computed in OethPTHORN time which induces the costOethP wmaxTHORN The total area scanned by the separators is atmost OethP2ethwmax=wavgTHORNTHORN For noninteger task weightscomplexity can reach OethP log N thorn P3ethwmax=wavgTHORNTHORN inthe worst case which occurs when only one separatorindex moves to the right by one position at each B valueWe should note here that using a min-heap for findingthe next B value enables terminating a repeat-loopiteration as soon as a separator-index does not moveThe trade-off in this scheme is the Oethlog PTHORN cost incurredat each separator-index move due to respectivekey-update operations on the heap We implementedthis scheme as well but observed increased executiontimes

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Fig 1 OethethN PTHORNPTHORN-time dynamic-programming algorithm proposed

by Choi and Narahari [6] and Olstad and Manne [30]

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 977

index i2 such that wi2 frac14 wmax and W1i21 frac14 Wi2thorn1sP=2frac14

ethWtot wmaxTHORN=4 and ti2 is assigned to the left part sothat sP=4 frac14 i2 and W1sP=4

frac14 ethWtot wmaxTHORN=4thorn wmax frac14Wtot=4thorn eth3=4THORNwmax For a sequence of log P such worst-case bisection steps on the left parts processor P1 will bethe bottleneck processor with load BRB frac14 W1s1 frac14Wtot=P thorn wmaxethP 1THORN=P amp

32 Dynamic programming

The overlapping subproblem space can be defined asT p

i for p frac14 1 2yP and i frac14 p p thorn 1yN P thorn pwhere T p

i denotes a p-way CCP of prefix task-subchainT 1i frac14 t1 t2y tiS onto prefix processor-subchainP1p frac14 P1P2yPpS Notice that index i is restrictedto ppipN P thorn p range because there is no merit inleaving a processor empty From this subproblem spacedefinition the optimal substructure property of the CCPproblem can be shown by considering an optimalmapping Pp

i frac14 s0 s1y sp frac14 iS with a bottleneckvalue B

pi for the CCP subproblem T p

i If the lastprocessor is not the bottleneck processor in Pp

i thenPp1

sp1frac14 s0 s1y sp1S should be an optimal mapping

for the subproblem T p1sp1

Hence recursive definition forthe bottleneck value of an optimal mapping is

Bpi frac14 min

p1pjoifmaxfB

p1j Wjthorn1igg eth1THORN

In (1) searching for index j corresponds to searching forseparator sp1 so that the remaining subchain T jthorn1i isassigned to the last processor Pp in an optimal mappingPp

i of T pi The bottleneck value BP

N of an optimalmapping can be computed using (1) in a bottom-upfashion starting from B1

i frac14 W1i for i frac14 1 2yN Aninitial prefix-sum on the workload array W enablesconstant-time computation of subchain weight of theform Wjthorn1i through Wjthorn1i frac14 Wfrac12i Wfrac12 j ComputingB

pi using (1) takes OethN pTHORN time for each i and p and

thus the algorithm takes OethethN PTHORN2PTHORN time since thenumber of distinct subproblems is equal to ethN P thorn 1THORNPChoi and Narahari [6] and Olstad and Manne [30]

reduced the complexity of this scheme to OethNPTHORN andOethethN PTHORNPTHORN respectively by exploiting the followingobservations that hold for positive task weights For afixed p in (1) the minimum index value j

pi defining B

pi

cannot occur at a value less than the minimum indexvalue j

pi1 defining B

pi1 ie j

pi1pj

pi pethi 1THORN Hence the

search for the optimal jpi can start from j

pi1 In (1) B

p1j

for a fixed p is a nondecreasing function of j and Wjthorn1ifor a fixed i is a decreasing function of j reducing to 0 atj frac14 i Thus two cases occur in a semi-closed intervalfrac12 j

pi1 iTHORN for j If Wjthorn1i4B

p1j initially then these two

functions intersect in frac12 jpi1 iTHORN In this case the search for

jpi continues until Wj13thorn1ipB

p1j and then only j13 and

j13 1 are considered for setting jpi with j

pi frac14 j13 if

Bp1j13 pWji and j

pi frac14 j13 1 otherwise Note that this

scheme automatically detects jpi frac14 i 1 if Wjthorn1i and

Bp1j intersect in the open interval ethi 1 iTHORN However if

Wjthorn1ipBp1j initially then B

p1j lies above Wjthorn1i in the

closed interval frac12 jpi1 i In this case the minimum value

occurs at the first value of j ie jpi frac14 j

pi1 These

improvements lead to an OethethN PTHORNPTHORN-time algorithmsince computation of all B

pi values for a fixed p makes

OethN PTHORN references to already computed Bp1j values

Fig 1 displays a run-time efficient implementation ofthis OethethN PTHORNPTHORN-time DP algorithm which avoids theexplicit minndashmax operation required in (1) In Fig 1 B

pi

values are stored in a table whose entries are computedin row-major order

33 Iterative refinement

The algorithm proposed by Manne and S^revik [23]referred to here as the MS algorithm finds a sequence ofnonoptimal partitions such that there is only one wayeach partition can be improved For this purpose theyintroduce the leftist partition (LP) Consider a partitionP such that Pp is the leftmost processor containing atleast two tasks P is defined as an LP if increasing theload of any processor Pc that lies to the right of Pp byaugmenting the last task of Pc1 to Pc makes Pc abottleneck processor with a load XCethPTHORN Let P be anLP with bottleneck processor Pb and bottleneck value BIf Pb contains only one task then P is optimal On theother hand assume that Pb contains at least two tasksThe refinement step which is shown by the inner whilendashloop in Fig 2 tries to find a new LP of lower cost bysuccessively removing the first task of Pp and augment-ing it to Pp1 for p frac14 b b 1y until LpoB Refine-ment fails when the whilendashloop proceeds until p frac14 1 withLpXB Manne and S^revik proved that a successfulrefinement of an LP gives a new LP and the LP isoptimal if the refinement fails They proposed using aninitial LP in which the P 1 leftmost processors each

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996978

has only one task and the last processor contains theremaining tasks The MS algorithm moves eachseparator index at most N P times so that the totalnumber of separator-index moves is OethPethN PTHORNTHORN Amax-heap is maintained for the processor loads to find abottleneck processor at the beginning of each repeat-loop iteration The cost of each separator-index move isOethlog P) since it necessitates one decrease-key and oneincrease-key operations Thus the complexity of the MSalgorithm is OethPethN PTHORN log PTHORN

34 Parametric search

The parametric-search approach relies on repeatedprobing for a partition P with a bottleneck value no

Fig 2 Iterative refinement algorithm proposed by Manne and S^revik

[23]

Fig 3 (a) Standard probe algorithm with OethP logNTHORN complexity (b) O

greater than a given B value Probe algorithms exploitthe greedy-choice property for existence and construc-tion of P The greedy choice here is to minimizeremaining work after loading processor Pp subject toLppB for p frac14 1yP 1 in order PROBEethBTHORN func-tions given in Fig 3 exploit this greedy property asfollows PROBE finds the largest index s1 so thatW1s1pB and assigns subchain T 1s1 to processor P1

with load L1 frac14 W1s1 Hence the first task in the secondprocessor is ts1thorn1 PROBE then similarly finds thelargest index s2 so that Ws1thorn1s2pB and assigns thesubchain T s1thorn1s2 to processor P2 This process continuesuntil either all tasks are assigned or all processors areexhausted The former and latter cases indicate feasi-bility and infeasibility of B respectivelyFig 3(a) illustrates the standard probe algorithm The

indices s1 s2y sP1 are efficiently found throughbinary search (BINSRCH) on the prefix-summed W-array In this figure BINSRCHethW iNBsumTHORN searchesW in the index range frac12iN to compute the index ipjpN

such that Wfrac12 jpBsum and Wfrac12 j thorn 14Bsum Thecomplexity of the standard probe algorithm isOethP log NTHORN Han et al [12] proposed an OethP log N=PTHORN-time probe algorithm (see Fig 3(b)) exploiting P

repeated binary searches on the same W arraywith increasing search values Their algorithm dividesthe chain into P subchains of equal length Ateach probe a linear search is performed on theweights of the last tasks of these P subchains to findout in which subchain the search value could beand then binary search is performed on the respectivesubchain of length N=P Note that since the probesearch values always increase linear search can beperformed incrementally that is search continuesfrom the last subchain that was searched to the rightwith OethPTHORN total cost This gives a total cost ofOethP logethN=PTHORNTHORN for P binary searches thus for the probefunction

ethP logethN=PTHORNTHORN-time probe algorithm proposed by Han et al [12]

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 979

341 Bisection as an approximation algorithm

Let f ethBTHORN be the binary-valued function where f ethBTHORN frac141 if PROBEethBTHORN is true and f ethBTHORN frac14 0 if PROBEethBTHORN isfalse Clearly f ethBTHORN is nondecreasing in B and Bopt liesbetween LB frac14 B13 frac14 Wtot=P and UB frac14 Wtot Theseobservations are exploited in the bisection algorithmleading to an efficient e-approximate algorithm where eis the desired precision The interval frac12Wtot=PWtot isconceptually discretized into ethWtot Wtot=PTHORN=e bottle-neck values and binary search is used in this range tofind the minimum feasible bottleneck value Bopt Thebisection algorithm as illustrated in Fig 4 performsOethlogethWtot=eTHORNTHORN PROBE calls and each PROBE callcosts OethP logethN=PTHORNTHORN Hence the bisection algorithmruns in OethN thorn P logethN=PTHORN logethWtot=eTHORNTHORN time whereOethNTHORN cost comes from the initial prefix-sum operationon W The performance of this algorithm deteriorateswhen logethWtot=eTHORN is comparable with N

342 Nicolrsquos algorithm

Nicolrsquos algorithm [27] exploits the fact that anycandidate B value is equal to weight Wij of a subchainA naive solution is to generate all subchain weights ofthe form Wij sort them and then use binary search tofind the minimum Wab value for which PROBEethWabTHORN frac14TRUE Nicolrsquos algorithm efficiently searches for theearliest range Wab for which Bopt frac14 Wab by consideringeach processor in order as a candidate bottleneckprocessor in an optimal mapping Let Popt be theoptimal mapping constructed by greedy PROBEethBoptTHORNand let processor Pb be the first bottleneck processorwith load Bopt in Popt frac14 s0 s1y sby sPS Underthese assumptions this greedy construction of Popt

ensures that each processor Pp preceding Pb is loadedas much as possible with LpoBopt for p frac14 1 2y b 1in Popt Here PROBEethLpTHORN frac14 FALSE since LpoBoptand PROBEethLp thorn Wspthorn1THORN frac14 TRUE since adding onemore task to processor Pp increases its load to Lp thornwspthorn14Bopt Hence if b frac14 1 then s1 is equal to thesmallest index i1 such that PROBEethW1i1THORN frac14 TRUE and

Fig 4 Bisection as an e-approximation algorithm

Bopt frac14 B1 frac14 W1s1 However if b41 then because of thegreedy choice property P1 should be loaded as much aspossible without exceeding Bopt frac14 BboB1 which impliesthat s1 frac14 i1 1 and hence L1 frac14 W1i11 If b frac14 2 then s2is equal to the smallest index i2 such thatPROBEethWi1i2THORN frac14 TRUE and Bopt frac14 B2 frac14 Wi1i2 Ifb42 then s2 frac14 i2 1 We iterate for b frac14 1 2yP 1 computing ib as the smallest index for whichPROBEethWib1ibTHORN frac14 TRUE and save Bb frac14 Wib1ib withiP frac14 N Finally the optimal bottleneck value is selectedas Bopt frac14 min1pbpP BbFig 5 illustrates Nicolrsquos algorithm As seen in this

figure given ib1 ib is found by performing a binarysearch over all subchain weights of the form Wib1j forib1pjpN in the bth iteration of the forndashloop HenceNicolrsquos algorithm performs Oethlog NTHORN PROBE calls tofind ib at iteration b and each probe call costsOethP logethN=PTHORNTHORN Thus the cost of computing anindividual Bb value is OethP log N logethN=PTHORNTHORN Since P 1 such Bb values are computed the overall complexity ofNicolrsquos algorithm is OethN thorn P2 log N logethN=PTHORNTHORN whereOethNTHORN cost comes from the initial prefix-sum operationon WTwo possible implementations of Nicolrsquos algorithm

are presented in Fig 5 Fig 5(a) illustrates a straightfor-ward implementation whereas Fig 5(b) illustrates acareful implementation which maintains and uses theinformation from previous probes to answer withoutcalling the PROBE function As seen in Fig 5(b) thisinformation is efficiently maintained as an undeterminedbottleneck-value range ethLBUBTHORN which is dynamicallyrefined after each probe Any bottleneck value encoun-tered outside the current range is immediately acceptedor rejected Although this simple scheme does notimprove the asymptotic complexity of the algorithm itdrastically reduces the number of probes as discussed inSection 6

4 Proposed CCP algorithms

In this section we present proposed methods Firstwe describe how to bound the separator indices for anoptimal solution to reduce the search space Then weshow how this technique can be used to improve theperformance of the dynamic programming algorithmWe continue with our discussion on improving the MSalgorithm and propose a novel refinement algorithmwhich we call the bidding algorithm Finally we discussparametric search methods proposing improvementsfor the bisection and Nicolrsquos algorithms

41 Restricting the search space

Our proposed CCP algorithms exploit lower andupper bounds on the optimal bottleneck value to restrict

ARTICLE IN PRESS

Fig 5 Nicolrsquos [27] algorithm (a) straightforward implementation (b) careful implementation with dynamic bottleneck-value bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996980

the search space for separator values as a preprocessingstep Natural lower and upper bounds for the optimalbottleneck value Bopt of a given CCP problem instanceethWNPTHORN are LB frac14 maxfB13wmaxg and UB frac14B13 thorn wmax respectively where B13 frac14 Wtot=P SincewmaxoB13 in coarse grain parallelization of most real-world applications our presentation will be forwmaxoB13 frac14 LB even though all results are valid whenB13 is replaced with maxfB13wmaxg The following lemmadescribes how to use these natural bounds on Bopt torestrict the search space for the separator values

Lemma 41 For a given CCP problem instance

ethWNPTHORN if Bf is a feasible bottleneck value in the

range frac12B13B13 thorn wmax then there exists a partition P frac14s0 s1y sPS of cost CethPTHORNpBf with SLppsppSHpfor 1ppoP where SLp and SHp are respectively the

smallest and largest indices such that

W1SLpXpethB13 wmaxethP pTHORN=PTHORN and

W1SHpppethB13 thorn wmaxethP pTHORN=PTHORN

Proof Let Bf frac14 B13 thorn w where 0pwowmax Partition Pcan be constructed by PROBEethBTHORN which loads the firstp processors as much as possible subject to LqpBf forq frac14 1 2y p In the worst case wspthorn1 frac14 wmax for eachof the first p processors Thus we have W1sp

Xf ethwTHORN frac14pethB13 thorn w wmaxTHORN for p frac14 1 2yP 1 However it

should be possible to divide the remaining subchainT spthorn1N into P p parts without exceeding Bf ieWspthorn1NpethP pTHORN ethB13 thorn wTHORN Thus we also haveW1sp

XgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn wTHORN Note that f ethwTHORNis an increasing function of w whereas gethwTHORN is adecreasing function of w The minimum ofmaxf f ethwTHORN gethwTHORNg is at the intersection of f ethwTHORN andgethwTHORN so that W1sp

XpethB13 wmaxethP pTHORN=PTHORNTo prove the upper bounds we can start with

W1sppf ethwTHORN frac14 pethB13 thorn wTHORN which holds when Lq frac14 B13 thorn

w for q frac14 1y p The condition Wspthorn1NXethP pTHORNethB13 thornw wmaxTHORN however ensures feasibility of Bf frac14 B13 thorn wsince PROBEethBTHORN can always load each of the remainingethP pTHORN processors with B13 thorn w wmax Thus we alsohave W1sp

pgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn w wmaxTHORNHere f ethwTHORN is an increasing function of w whereasgethwTHORN is a decreasing function of w which yieldsW1sp

ppethB13 thorn wmaxethP pTHORN=PTHORN amp

Corollary 42 The separator range weights are DWp frac14PSHp

ifrac14SLpwi frac14 W1SHp

W1SLpfrac14 2wmaxpethP pTHORN=P with a

maximum value Pwmax=2 at p frac14 P=2

Applying this corollary requires finding wmax whichentails an overhead equivalent to that of the prefix-sumoperation and hence should be avoided In this workwe adopt a practical scheme to construct the bounds onseparator indices We run the RB heuristic to find a

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 981

hopefully good bottleneck value BRB and use BRB as anupper bound for bottleneck values ie UB frac14 BRBThen we run LR-PROBEethBRBTHORN and RL-PROBEethBRBTHORNto construct two mappings P1 frac14 h10 h11y h1PS andP2 frac14 c20 c

21y c2PS with CethP1THORNCethP2THORNpBRB Here

LR-PROBE denotes the left-to-right probe given inFig 3 whereas RL-PROBE denotes a right-to-leftprobe function which can be considered as the dual ofthe LR-PROBE RL-PROBE exploits the greedy-choiceproperty from right to left That is RL-PROBE assignssubchains from the right end towards the left end of thetask chain to processors in the order PPPP1yP1From these two mappings lower and upper boundvalues for sp separator indices are constructed as SLp frac14c2p and SHp frac14 h1p respectively These bounds are furtherrefined by running LR-PROBEethB13THORN andRL-PROBEethB13THORN to construct two mappings P3 frac14c30 c

31y c3PS and P4 frac14 h40 h41y h4PS and then

defining SLp frac14 maxfSLp c3pg and SHp frac14 minfSHp h4pg

for 1ppoP Lemmas 43 and 44 prove correctness ofthese bounds

Lemma 43 For a given CCP problem instance ethWNPTHORNand a feasible bottleneck value Bf let P1 frac14h10 h11y h1PS and P2 frac14 c20 c

21y c2PS be the parti-

tions constructed by LR-PROBEethBfTHORN and RL-PROBEethBfTHORNrespectively Then any partition P frac14 s0 s1y sPS of cost

CethPTHORN frac14 BpBf satisfies c2ppspph1p

Proof By the property of LR-PROBEethBfTHORN h1p is thelargest index such that T 1h1p

can be partitioned into p

parts without exceeding Bf If sp4h1p then the bottle-neck value will exceed Bf and thus B By the property ofRL-PROBEethBfTHORN c2p is the smallest index where T c2pNcan be partitioned into P p parts without exceedingBf If spoc2p then the bottleneck value will exceed Bf

and thus B amp

Lemma 44 For a given CCP problem instance

ethWNPTHORN let P3 frac14 c30 c31y c3PS and P4 frac14

h40 h41y h4PS be the partitions constructed by

LR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value Bf there exists a

partition P frac14 s0 s1y sPS of cost CethPTHORNpBf that

satisfies c3ppspph4p

Proof Consider the partition P frac14 s0 s1y sPS con-structed by LR-PROBEethBfTHORN It is clear that thispartition already satisfies the lower bounds iespXc3p Assume sp4h4p then partition P0 obtained bymoving sp back to h4p also yields a partition with costCethP0THORNpBf since T h4pthorn1N can be partitioned into P p

parts without exceeding B13 amp

The difference between Lemmas 43 and 44 is that theformer ensures the existence of all partitions with

costpBf within the given separator-index rangeswhereas the latter ensures only the existence of at leastone such partition within the given ranges Thefollowing corollary combines the results of these twolemmas

Corollary 45 For a given CCP problem instance

ethWNPTHORN and a feasible bottleneck value Bf let

P1 frac14 h10 h11y h1PS P2 frac14 c20 c21y c2PS P3 frac14

c30 c31y c3PS and P4 frac14 h40 h

41y h4PS be the parti-

tions constructed by LR-PROBEethBfTHORN RL-PROBEethBfTHORNLR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value B in the range

frac12B13Bf there exists a partition P frac14 s0 s1y sPS of

cost CethPTHORNpB with SLppsppSHp for 1ppoP where

SLp frac14 maxfc2p c3pg and SHp frac14 minfh1p h4pg

Corollary 46 The separator range weights become

DWp frac14 2 minf p ethP pTHORNgwmax in the worst case with a

maximum value Pwmax at p frac14 P=2

Lemma 41 and Corollary 45 infer the followingtheorem since B13pBoptpB13 thorn wmax

Theorem 47 For a given CCP problem instance

ethWNPTHORN and SLp and SHp index bounds constructed

according to Lemma 41 or Corollary 45 there exists

an optimal partition Popt frac14 s0 s1y sPS with SLppsppSHp for 1ppoP

Comparison of separator range weights in Lemma 41and Corollary 45 shows that separator range weightsproduced by the practical scheme described inCorollary 45 may be worse than those of Lemma 41by a factor of two This is only the worst-case behaviorhowever and the practical scheme normally finds muchbetter bounds since order in the chain usually preventsthe worst-case behavior and BRBoB13 thorn wmax Experi-mental results in Section 6 justify this expectation

411 Complexity analysis models

Corollaries 42 and 46 give bounds on the weights ofthe separator-index ranges However we need boundson the sizes of these separator-index ranges forcomputational complexity analysis of the proposedCCP algorithms Here the size DSp frac14 SHp SLp thorn 1denotes the number of tasks within the pth rangefrac12SLpSHp Miguet and Pierson [24] propose the modelwi frac14 yethwavgTHORN for i frac14 1 2yN to prove that their H1and H2 heuristics allocate yethN=PTHORN tasks to eachprocessor where wavg frac14 Wtot=N is the average taskweight This assumption means that the weight of eachtask is not too far away from the average task weightUsing Corollaries 42 and 45 this model induces DSp frac14OethP wmax=wavgTHORN Moreover this model can be exploitedto induce the optimistic bound DSp frac14 OethPTHORN However

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996982

we find their model too restrictive since the minimumand maximum task weights can deviate substantiallyfrom wavg Here we establish a looser and more realisticmodel on task weights so that for any subchain T ij

with weight Wij sufficiently larger than wmax theaverage task weight within subchain T ij satisfiesOethwavgTHORN That is Dij frac14 j i thorn 1 frac14 OethWij=wavgTHORN Thismodel referred to here as model M directly inducesDSp frac14 OethPwmax=wavgTHORN since DWppDWP=2 frac14 Pwmax=2for p frac14 1 2yP 1

42 Dynamic-programming algorithm with static

separator-index bounding

The proposed DP algorithm referred to here as theDP+ algorithm exploits bounds on the separatorindices for an efficient solution Fig 6 illustrates theproposed DP+ algorithm where input parameters SL

and SH denote the index bound arrays each of size Pcomputed according to Corollary 45 with Bf frac14 BRBNote that SLP frac14 SHP frac14 N since only Bfrac12PN needs tobe computed in the last row As seen in Fig 6 only B

pj

values for j frac14 SLpSLp thorn 1ySHp are computed ateach row p by exploiting Corollary 45 which ensuresexistence of an optimal partition Popt frac14 s0 s1y sPSwith SLppsppSHp Thus these B

pj values will suffice

for correct computation of Bpthorn1i values for i frac14

SLpthorn1SLpthorn1 thorn 1ySHpthorn1 at the next row p thorn 1As seen in Fig 6 explicit range checking is avoided in

this algorithm for utmost efficiency However the j-index may proceed beyond SHp to SHp thorn 1 within therepeatndashuntil-loop while computing B

pthorn1i with

SLpthorn1pipSHpthorn1 in two cases In both cases functionsWjthorn1i and B

pj intersect in the open interval ethSHpSHp thorn

1THORN so that BpSHp

oWSHpthorn1i and BpSHpthorn1XWSHpthorn2i In the

first case i frac14 SHp thorn 1 so that Wjthorn1i and Bpj intersect in

Fig 6 Dynamic-programming algorithm with static separator-index

bounding

ethi 1 iTHORN which implies that Bpthorn1i frac14 Wi1i with j

pthorn1i frac14

SLp since Wi1ioBpi as mentioned in Section 32 In

the second case i4SHp thorn 1 for which Corollary 45guarantees that B

pthorn1i frac14 WSLpthorn1ipB

pSHpthorn1 and thus we

can safely select jpthorn1i frac14 SLp Note that WSLpthorn1i frac14

BpSHpthorn1 may correspond to a case leading to another

optimal partition with jpthorn1i frac14 spthorn1 frac14 SHp thorn 1 As seen in

Fig 6 both cases are efficiently resolved simply bystoring N to B

pSHpthorn1 as a sentinel Hence in such cases

the condition WSHpthorn1ioBpSHpthorn1 frac14 N in the ifndashthen

statement following the repeatndashuntil-loop is always trueso that the j-index automatically moves back to SHpThe scheme of computing B

pSHpthorn1 for each row p which

seems to be a natural solution does not work sincecorrect computation of B

pthorn1SHpthorn1thorn1 may necessitate more

than one Bpj value beyond the SHp index bound

A nice feature of the DP approach is that it can beused to generate all optimal partitions by maintaining aP N matrix to store the minimum j

pi index values

defining the Bpi values at the expense of increased

execution time and increased asymptotic space require-ment Recall that index bounds SL and SH computedaccording to Corollary 45 restrict the search space forat least one optimal solution The index bounds can becomputed according to Lemma 44 for this purposesince the search space restricted by Lemma 44 includesall optimal solutionsThe running time of the proposed DP+ algorithm is

OethN thorn P log NTHORN thornPP

pfrac141 yethDSpTHORN Here OethNTHORN cost comesfrom the initial prefix-sum operation on the W arrayand OethP log NTHORN cost comes from the running time of theRB heuristic and computing the separator-index boundsSL and SH according to Corollary 45 Under modelM DSp frac14 OethPwmax=wavgTHORN and hence the complexity isOethN thorn P log N thorn P2 wmax = wavgTHORN The algorithm be-comes linear in N when the separator-index ranges donot overlap which is guaranteed by the conditionwmax frac14 Oeth2Wtot=P2THORN

43 Iterative refinement algorithms

In this work we improve the MS algorithm andpropose a novel CCP algorithm namely the bidding

algorithm which is run-time efficient for small-to-medium number of processors The main differencebetween the MS and the bidding algorithms is asfollows the MS algorithm moves along a series offeasible bottleneck values whereas the bidding algo-rithm moves along a sequence of infeasible bottleneckvalues so that the first feasible bottleneck value becomesthe optimal value

431 Improving the MS algorithm

The performance of the MS algorithm stronglydepends on the initial partition The initial partitionproposed by Manne and S^revik [23] satisfies the leftist

ARTICLE IN PRESS

Fig 7 Bidding algorithm

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 983

partition constraint but it leads to very poor run-timeperformance Here we propose using the partitiongenerated by PROBEethB13THORN as an initial partition Thispartition is also a leftist partition since moving anyseparator to the left will not decrease the load of thebottleneck processor This simple observation leads tosignificant improvement in run-time performance of thealgorithm Also using a heap as a priority queue does notgive better run-time performance than using a runningmaximum despite its superior asymptotic complexity Inour implementation we use a running maximum

432 Bidding algorithm

This algorithm increases the bottleneck value gradu-ally starting from the ideal bottleneck value B13 until itfinds a feasible partition which is also optimal Considera partition Pt frac14 s0 s1y sPS constructed byPROBEethBtTHORN for an infeasible Bt After detecting theinfeasibility of this Bt value the point is to determine thenext larger bottleneck value B to be investigatedClearly the separator indices of the partitions to beconstructed by future PROBEethBTHORN calls with B4Bt willnever be to the left of the respective separator indices ofPt Moreover at least one of the separators shouldmove right for feasibility since the load of the lastprocessor determines infeasibility of the current Bt value(ie LP4Bt) To avoid missing the smallest feasiblebottleneck value the next larger B value is selected asthe minimum of processor loads that will be obtained bymoving the end-index of every processor to the right byone position That is the next larger B value is equal tominfmin1ppoP fLp thorn wspthorn1gLPg Here we call theLp thorn wspthorn1 value the bid of processor Pp which refersto the load of Pp if the first task tspthorn1 of the nextprocessor is augmented to Pp The bid of the lastprocessor PP is equal to the load of the remaining tasksIf the smallest bid B comes from processor Pb probingwith new B is performed only for the remainingprocessors PbPbthorn1yPPS in the suffix Wsb1thorn1Nof the W arrayThe bidding algorithm is presented in Fig 7 The

innermost whilendashloop implements a linear probingscheme such that the new positions of the separatorsare determined by moving them to the right one by oneThis linear probing scheme is selected because newpositions of separators are likely to be in a closeneighborhood of previous ones Note that binary searchis used only for setting the separator indices for the firsttime After the separator index sp is set for processor Pp

during linear probing the repeatndashuntil-loop terminates ifit is not possible to partition the remaining subchainT spthorn1N into P p processors without exceeding thecurrent B value ie rbid frac14 Lr=ethP pTHORN4B where Lr

denotes the weight of the remaining subchain In thiscase the next larger B value is determined by consider-ing the best bid among the first p processors and rbid

As seen in Fig 7 we maintain a prefix-minimumarray BIDS for computing the next larger B value HereBIDS is an array of records of length P whereBIDSfrac12 pB and BIDSfrac12 pb store the best bid value ofthe first p processors and the index of the definingprocessor respectively BIDSfrac120 helps the correctness ofthe running prefix-minimum operationThe complexity of the bidding algorithm for integer

task weights under model M is OethN thorn P log N thornP wmax thorn P2ethwmax=wavgTHORNTHORN Here OethNTHORN cost comes fromthe initial prefix-sum operation on the W array andOethP log NTHORN cost comes from initial settings of separatorsthrough binary search The B value is increased at mostBopt B13owmax times and each time the next B valuecan be computed in OethPTHORN time which induces the costOethP wmaxTHORN The total area scanned by the separators is atmost OethP2ethwmax=wavgTHORNTHORN For noninteger task weightscomplexity can reach OethP log N thorn P3ethwmax=wavgTHORNTHORN inthe worst case which occurs when only one separatorindex moves to the right by one position at each B valueWe should note here that using a min-heap for findingthe next B value enables terminating a repeat-loopiteration as soon as a separator-index does not moveThe trade-off in this scheme is the Oethlog PTHORN cost incurredat each separator-index move due to respectivekey-update operations on the heap We implementedthis scheme as well but observed increased executiontimes

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996978

has only one task and the last processor contains theremaining tasks The MS algorithm moves eachseparator index at most N P times so that the totalnumber of separator-index moves is OethPethN PTHORNTHORN Amax-heap is maintained for the processor loads to find abottleneck processor at the beginning of each repeat-loop iteration The cost of each separator-index move isOethlog P) since it necessitates one decrease-key and oneincrease-key operations Thus the complexity of the MSalgorithm is OethPethN PTHORN log PTHORN

34 Parametric search

The parametric-search approach relies on repeatedprobing for a partition P with a bottleneck value no

Fig 2 Iterative refinement algorithm proposed by Manne and S^revik

[23]

Fig 3 (a) Standard probe algorithm with OethP logNTHORN complexity (b) O

greater than a given B value Probe algorithms exploitthe greedy-choice property for existence and construc-tion of P The greedy choice here is to minimizeremaining work after loading processor Pp subject toLppB for p frac14 1yP 1 in order PROBEethBTHORN func-tions given in Fig 3 exploit this greedy property asfollows PROBE finds the largest index s1 so thatW1s1pB and assigns subchain T 1s1 to processor P1

with load L1 frac14 W1s1 Hence the first task in the secondprocessor is ts1thorn1 PROBE then similarly finds thelargest index s2 so that Ws1thorn1s2pB and assigns thesubchain T s1thorn1s2 to processor P2 This process continuesuntil either all tasks are assigned or all processors areexhausted The former and latter cases indicate feasi-bility and infeasibility of B respectivelyFig 3(a) illustrates the standard probe algorithm The

indices s1 s2y sP1 are efficiently found throughbinary search (BINSRCH) on the prefix-summed W-array In this figure BINSRCHethW iNBsumTHORN searchesW in the index range frac12iN to compute the index ipjpN

such that Wfrac12 jpBsum and Wfrac12 j thorn 14Bsum Thecomplexity of the standard probe algorithm isOethP log NTHORN Han et al [12] proposed an OethP log N=PTHORN-time probe algorithm (see Fig 3(b)) exploiting P

repeated binary searches on the same W arraywith increasing search values Their algorithm dividesthe chain into P subchains of equal length Ateach probe a linear search is performed on theweights of the last tasks of these P subchains to findout in which subchain the search value could beand then binary search is performed on the respectivesubchain of length N=P Note that since the probesearch values always increase linear search can beperformed incrementally that is search continuesfrom the last subchain that was searched to the rightwith OethPTHORN total cost This gives a total cost ofOethP logethN=PTHORNTHORN for P binary searches thus for the probefunction

ethP logethN=PTHORNTHORN-time probe algorithm proposed by Han et al [12]

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 979

341 Bisection as an approximation algorithm

Let f ethBTHORN be the binary-valued function where f ethBTHORN frac141 if PROBEethBTHORN is true and f ethBTHORN frac14 0 if PROBEethBTHORN isfalse Clearly f ethBTHORN is nondecreasing in B and Bopt liesbetween LB frac14 B13 frac14 Wtot=P and UB frac14 Wtot Theseobservations are exploited in the bisection algorithmleading to an efficient e-approximate algorithm where eis the desired precision The interval frac12Wtot=PWtot isconceptually discretized into ethWtot Wtot=PTHORN=e bottle-neck values and binary search is used in this range tofind the minimum feasible bottleneck value Bopt Thebisection algorithm as illustrated in Fig 4 performsOethlogethWtot=eTHORNTHORN PROBE calls and each PROBE callcosts OethP logethN=PTHORNTHORN Hence the bisection algorithmruns in OethN thorn P logethN=PTHORN logethWtot=eTHORNTHORN time whereOethNTHORN cost comes from the initial prefix-sum operationon W The performance of this algorithm deteriorateswhen logethWtot=eTHORN is comparable with N

342 Nicolrsquos algorithm

Nicolrsquos algorithm [27] exploits the fact that anycandidate B value is equal to weight Wij of a subchainA naive solution is to generate all subchain weights ofthe form Wij sort them and then use binary search tofind the minimum Wab value for which PROBEethWabTHORN frac14TRUE Nicolrsquos algorithm efficiently searches for theearliest range Wab for which Bopt frac14 Wab by consideringeach processor in order as a candidate bottleneckprocessor in an optimal mapping Let Popt be theoptimal mapping constructed by greedy PROBEethBoptTHORNand let processor Pb be the first bottleneck processorwith load Bopt in Popt frac14 s0 s1y sby sPS Underthese assumptions this greedy construction of Popt

ensures that each processor Pp preceding Pb is loadedas much as possible with LpoBopt for p frac14 1 2y b 1in Popt Here PROBEethLpTHORN frac14 FALSE since LpoBoptand PROBEethLp thorn Wspthorn1THORN frac14 TRUE since adding onemore task to processor Pp increases its load to Lp thornwspthorn14Bopt Hence if b frac14 1 then s1 is equal to thesmallest index i1 such that PROBEethW1i1THORN frac14 TRUE and

Fig 4 Bisection as an e-approximation algorithm

Bopt frac14 B1 frac14 W1s1 However if b41 then because of thegreedy choice property P1 should be loaded as much aspossible without exceeding Bopt frac14 BboB1 which impliesthat s1 frac14 i1 1 and hence L1 frac14 W1i11 If b frac14 2 then s2is equal to the smallest index i2 such thatPROBEethWi1i2THORN frac14 TRUE and Bopt frac14 B2 frac14 Wi1i2 Ifb42 then s2 frac14 i2 1 We iterate for b frac14 1 2yP 1 computing ib as the smallest index for whichPROBEethWib1ibTHORN frac14 TRUE and save Bb frac14 Wib1ib withiP frac14 N Finally the optimal bottleneck value is selectedas Bopt frac14 min1pbpP BbFig 5 illustrates Nicolrsquos algorithm As seen in this

figure given ib1 ib is found by performing a binarysearch over all subchain weights of the form Wib1j forib1pjpN in the bth iteration of the forndashloop HenceNicolrsquos algorithm performs Oethlog NTHORN PROBE calls tofind ib at iteration b and each probe call costsOethP logethN=PTHORNTHORN Thus the cost of computing anindividual Bb value is OethP log N logethN=PTHORNTHORN Since P 1 such Bb values are computed the overall complexity ofNicolrsquos algorithm is OethN thorn P2 log N logethN=PTHORNTHORN whereOethNTHORN cost comes from the initial prefix-sum operationon WTwo possible implementations of Nicolrsquos algorithm

are presented in Fig 5 Fig 5(a) illustrates a straightfor-ward implementation whereas Fig 5(b) illustrates acareful implementation which maintains and uses theinformation from previous probes to answer withoutcalling the PROBE function As seen in Fig 5(b) thisinformation is efficiently maintained as an undeterminedbottleneck-value range ethLBUBTHORN which is dynamicallyrefined after each probe Any bottleneck value encoun-tered outside the current range is immediately acceptedor rejected Although this simple scheme does notimprove the asymptotic complexity of the algorithm itdrastically reduces the number of probes as discussed inSection 6

4 Proposed CCP algorithms

In this section we present proposed methods Firstwe describe how to bound the separator indices for anoptimal solution to reduce the search space Then weshow how this technique can be used to improve theperformance of the dynamic programming algorithmWe continue with our discussion on improving the MSalgorithm and propose a novel refinement algorithmwhich we call the bidding algorithm Finally we discussparametric search methods proposing improvementsfor the bisection and Nicolrsquos algorithms

41 Restricting the search space

Our proposed CCP algorithms exploit lower andupper bounds on the optimal bottleneck value to restrict

ARTICLE IN PRESS

Fig 5 Nicolrsquos [27] algorithm (a) straightforward implementation (b) careful implementation with dynamic bottleneck-value bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996980

the search space for separator values as a preprocessingstep Natural lower and upper bounds for the optimalbottleneck value Bopt of a given CCP problem instanceethWNPTHORN are LB frac14 maxfB13wmaxg and UB frac14B13 thorn wmax respectively where B13 frac14 Wtot=P SincewmaxoB13 in coarse grain parallelization of most real-world applications our presentation will be forwmaxoB13 frac14 LB even though all results are valid whenB13 is replaced with maxfB13wmaxg The following lemmadescribes how to use these natural bounds on Bopt torestrict the search space for the separator values

Lemma 41 For a given CCP problem instance

ethWNPTHORN if Bf is a feasible bottleneck value in the

range frac12B13B13 thorn wmax then there exists a partition P frac14s0 s1y sPS of cost CethPTHORNpBf with SLppsppSHpfor 1ppoP where SLp and SHp are respectively the

smallest and largest indices such that

W1SLpXpethB13 wmaxethP pTHORN=PTHORN and

W1SHpppethB13 thorn wmaxethP pTHORN=PTHORN

Proof Let Bf frac14 B13 thorn w where 0pwowmax Partition Pcan be constructed by PROBEethBTHORN which loads the firstp processors as much as possible subject to LqpBf forq frac14 1 2y p In the worst case wspthorn1 frac14 wmax for eachof the first p processors Thus we have W1sp

Xf ethwTHORN frac14pethB13 thorn w wmaxTHORN for p frac14 1 2yP 1 However it

should be possible to divide the remaining subchainT spthorn1N into P p parts without exceeding Bf ieWspthorn1NpethP pTHORN ethB13 thorn wTHORN Thus we also haveW1sp

XgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn wTHORN Note that f ethwTHORNis an increasing function of w whereas gethwTHORN is adecreasing function of w The minimum ofmaxf f ethwTHORN gethwTHORNg is at the intersection of f ethwTHORN andgethwTHORN so that W1sp

XpethB13 wmaxethP pTHORN=PTHORNTo prove the upper bounds we can start with

W1sppf ethwTHORN frac14 pethB13 thorn wTHORN which holds when Lq frac14 B13 thorn

w for q frac14 1y p The condition Wspthorn1NXethP pTHORNethB13 thornw wmaxTHORN however ensures feasibility of Bf frac14 B13 thorn wsince PROBEethBTHORN can always load each of the remainingethP pTHORN processors with B13 thorn w wmax Thus we alsohave W1sp

pgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn w wmaxTHORNHere f ethwTHORN is an increasing function of w whereasgethwTHORN is a decreasing function of w which yieldsW1sp

ppethB13 thorn wmaxethP pTHORN=PTHORN amp

Corollary 42 The separator range weights are DWp frac14PSHp

ifrac14SLpwi frac14 W1SHp

W1SLpfrac14 2wmaxpethP pTHORN=P with a

maximum value Pwmax=2 at p frac14 P=2

Applying this corollary requires finding wmax whichentails an overhead equivalent to that of the prefix-sumoperation and hence should be avoided In this workwe adopt a practical scheme to construct the bounds onseparator indices We run the RB heuristic to find a

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 981

hopefully good bottleneck value BRB and use BRB as anupper bound for bottleneck values ie UB frac14 BRBThen we run LR-PROBEethBRBTHORN and RL-PROBEethBRBTHORNto construct two mappings P1 frac14 h10 h11y h1PS andP2 frac14 c20 c

21y c2PS with CethP1THORNCethP2THORNpBRB Here

LR-PROBE denotes the left-to-right probe given inFig 3 whereas RL-PROBE denotes a right-to-leftprobe function which can be considered as the dual ofthe LR-PROBE RL-PROBE exploits the greedy-choiceproperty from right to left That is RL-PROBE assignssubchains from the right end towards the left end of thetask chain to processors in the order PPPP1yP1From these two mappings lower and upper boundvalues for sp separator indices are constructed as SLp frac14c2p and SHp frac14 h1p respectively These bounds are furtherrefined by running LR-PROBEethB13THORN andRL-PROBEethB13THORN to construct two mappings P3 frac14c30 c

31y c3PS and P4 frac14 h40 h41y h4PS and then

defining SLp frac14 maxfSLp c3pg and SHp frac14 minfSHp h4pg

for 1ppoP Lemmas 43 and 44 prove correctness ofthese bounds

Lemma 43 For a given CCP problem instance ethWNPTHORNand a feasible bottleneck value Bf let P1 frac14h10 h11y h1PS and P2 frac14 c20 c

21y c2PS be the parti-

tions constructed by LR-PROBEethBfTHORN and RL-PROBEethBfTHORNrespectively Then any partition P frac14 s0 s1y sPS of cost

CethPTHORN frac14 BpBf satisfies c2ppspph1p

Proof By the property of LR-PROBEethBfTHORN h1p is thelargest index such that T 1h1p

can be partitioned into p

parts without exceeding Bf If sp4h1p then the bottle-neck value will exceed Bf and thus B By the property ofRL-PROBEethBfTHORN c2p is the smallest index where T c2pNcan be partitioned into P p parts without exceedingBf If spoc2p then the bottleneck value will exceed Bf

and thus B amp

Lemma 44 For a given CCP problem instance

ethWNPTHORN let P3 frac14 c30 c31y c3PS and P4 frac14

h40 h41y h4PS be the partitions constructed by

LR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value Bf there exists a

partition P frac14 s0 s1y sPS of cost CethPTHORNpBf that

satisfies c3ppspph4p

Proof Consider the partition P frac14 s0 s1y sPS con-structed by LR-PROBEethBfTHORN It is clear that thispartition already satisfies the lower bounds iespXc3p Assume sp4h4p then partition P0 obtained bymoving sp back to h4p also yields a partition with costCethP0THORNpBf since T h4pthorn1N can be partitioned into P p

parts without exceeding B13 amp

The difference between Lemmas 43 and 44 is that theformer ensures the existence of all partitions with

costpBf within the given separator-index rangeswhereas the latter ensures only the existence of at leastone such partition within the given ranges Thefollowing corollary combines the results of these twolemmas

Corollary 45 For a given CCP problem instance

ethWNPTHORN and a feasible bottleneck value Bf let

P1 frac14 h10 h11y h1PS P2 frac14 c20 c21y c2PS P3 frac14

c30 c31y c3PS and P4 frac14 h40 h

41y h4PS be the parti-

tions constructed by LR-PROBEethBfTHORN RL-PROBEethBfTHORNLR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value B in the range

frac12B13Bf there exists a partition P frac14 s0 s1y sPS of

cost CethPTHORNpB with SLppsppSHp for 1ppoP where

SLp frac14 maxfc2p c3pg and SHp frac14 minfh1p h4pg

Corollary 46 The separator range weights become

DWp frac14 2 minf p ethP pTHORNgwmax in the worst case with a

maximum value Pwmax at p frac14 P=2

Lemma 41 and Corollary 45 infer the followingtheorem since B13pBoptpB13 thorn wmax

Theorem 47 For a given CCP problem instance

ethWNPTHORN and SLp and SHp index bounds constructed

according to Lemma 41 or Corollary 45 there exists

an optimal partition Popt frac14 s0 s1y sPS with SLppsppSHp for 1ppoP

Comparison of separator range weights in Lemma 41and Corollary 45 shows that separator range weightsproduced by the practical scheme described inCorollary 45 may be worse than those of Lemma 41by a factor of two This is only the worst-case behaviorhowever and the practical scheme normally finds muchbetter bounds since order in the chain usually preventsthe worst-case behavior and BRBoB13 thorn wmax Experi-mental results in Section 6 justify this expectation

411 Complexity analysis models

Corollaries 42 and 46 give bounds on the weights ofthe separator-index ranges However we need boundson the sizes of these separator-index ranges forcomputational complexity analysis of the proposedCCP algorithms Here the size DSp frac14 SHp SLp thorn 1denotes the number of tasks within the pth rangefrac12SLpSHp Miguet and Pierson [24] propose the modelwi frac14 yethwavgTHORN for i frac14 1 2yN to prove that their H1and H2 heuristics allocate yethN=PTHORN tasks to eachprocessor where wavg frac14 Wtot=N is the average taskweight This assumption means that the weight of eachtask is not too far away from the average task weightUsing Corollaries 42 and 45 this model induces DSp frac14OethP wmax=wavgTHORN Moreover this model can be exploitedto induce the optimistic bound DSp frac14 OethPTHORN However

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996982

we find their model too restrictive since the minimumand maximum task weights can deviate substantiallyfrom wavg Here we establish a looser and more realisticmodel on task weights so that for any subchain T ij

with weight Wij sufficiently larger than wmax theaverage task weight within subchain T ij satisfiesOethwavgTHORN That is Dij frac14 j i thorn 1 frac14 OethWij=wavgTHORN Thismodel referred to here as model M directly inducesDSp frac14 OethPwmax=wavgTHORN since DWppDWP=2 frac14 Pwmax=2for p frac14 1 2yP 1

42 Dynamic-programming algorithm with static

separator-index bounding

The proposed DP algorithm referred to here as theDP+ algorithm exploits bounds on the separatorindices for an efficient solution Fig 6 illustrates theproposed DP+ algorithm where input parameters SL

and SH denote the index bound arrays each of size Pcomputed according to Corollary 45 with Bf frac14 BRBNote that SLP frac14 SHP frac14 N since only Bfrac12PN needs tobe computed in the last row As seen in Fig 6 only B

pj

values for j frac14 SLpSLp thorn 1ySHp are computed ateach row p by exploiting Corollary 45 which ensuresexistence of an optimal partition Popt frac14 s0 s1y sPSwith SLppsppSHp Thus these B

pj values will suffice

for correct computation of Bpthorn1i values for i frac14

SLpthorn1SLpthorn1 thorn 1ySHpthorn1 at the next row p thorn 1As seen in Fig 6 explicit range checking is avoided in

this algorithm for utmost efficiency However the j-index may proceed beyond SHp to SHp thorn 1 within therepeatndashuntil-loop while computing B

pthorn1i with

SLpthorn1pipSHpthorn1 in two cases In both cases functionsWjthorn1i and B

pj intersect in the open interval ethSHpSHp thorn

1THORN so that BpSHp

oWSHpthorn1i and BpSHpthorn1XWSHpthorn2i In the

first case i frac14 SHp thorn 1 so that Wjthorn1i and Bpj intersect in

Fig 6 Dynamic-programming algorithm with static separator-index

bounding

ethi 1 iTHORN which implies that Bpthorn1i frac14 Wi1i with j

pthorn1i frac14

SLp since Wi1ioBpi as mentioned in Section 32 In

the second case i4SHp thorn 1 for which Corollary 45guarantees that B

pthorn1i frac14 WSLpthorn1ipB

pSHpthorn1 and thus we

can safely select jpthorn1i frac14 SLp Note that WSLpthorn1i frac14

BpSHpthorn1 may correspond to a case leading to another

optimal partition with jpthorn1i frac14 spthorn1 frac14 SHp thorn 1 As seen in

Fig 6 both cases are efficiently resolved simply bystoring N to B

pSHpthorn1 as a sentinel Hence in such cases

the condition WSHpthorn1ioBpSHpthorn1 frac14 N in the ifndashthen

statement following the repeatndashuntil-loop is always trueso that the j-index automatically moves back to SHpThe scheme of computing B

pSHpthorn1 for each row p which

seems to be a natural solution does not work sincecorrect computation of B

pthorn1SHpthorn1thorn1 may necessitate more

than one Bpj value beyond the SHp index bound

A nice feature of the DP approach is that it can beused to generate all optimal partitions by maintaining aP N matrix to store the minimum j

pi index values

defining the Bpi values at the expense of increased

execution time and increased asymptotic space require-ment Recall that index bounds SL and SH computedaccording to Corollary 45 restrict the search space forat least one optimal solution The index bounds can becomputed according to Lemma 44 for this purposesince the search space restricted by Lemma 44 includesall optimal solutionsThe running time of the proposed DP+ algorithm is

OethN thorn P log NTHORN thornPP

pfrac141 yethDSpTHORN Here OethNTHORN cost comesfrom the initial prefix-sum operation on the W arrayand OethP log NTHORN cost comes from the running time of theRB heuristic and computing the separator-index boundsSL and SH according to Corollary 45 Under modelM DSp frac14 OethPwmax=wavgTHORN and hence the complexity isOethN thorn P log N thorn P2 wmax = wavgTHORN The algorithm be-comes linear in N when the separator-index ranges donot overlap which is guaranteed by the conditionwmax frac14 Oeth2Wtot=P2THORN

43 Iterative refinement algorithms

In this work we improve the MS algorithm andpropose a novel CCP algorithm namely the bidding

algorithm which is run-time efficient for small-to-medium number of processors The main differencebetween the MS and the bidding algorithms is asfollows the MS algorithm moves along a series offeasible bottleneck values whereas the bidding algo-rithm moves along a sequence of infeasible bottleneckvalues so that the first feasible bottleneck value becomesthe optimal value

431 Improving the MS algorithm

The performance of the MS algorithm stronglydepends on the initial partition The initial partitionproposed by Manne and S^revik [23] satisfies the leftist

ARTICLE IN PRESS

Fig 7 Bidding algorithm

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 983

partition constraint but it leads to very poor run-timeperformance Here we propose using the partitiongenerated by PROBEethB13THORN as an initial partition Thispartition is also a leftist partition since moving anyseparator to the left will not decrease the load of thebottleneck processor This simple observation leads tosignificant improvement in run-time performance of thealgorithm Also using a heap as a priority queue does notgive better run-time performance than using a runningmaximum despite its superior asymptotic complexity Inour implementation we use a running maximum

432 Bidding algorithm

This algorithm increases the bottleneck value gradu-ally starting from the ideal bottleneck value B13 until itfinds a feasible partition which is also optimal Considera partition Pt frac14 s0 s1y sPS constructed byPROBEethBtTHORN for an infeasible Bt After detecting theinfeasibility of this Bt value the point is to determine thenext larger bottleneck value B to be investigatedClearly the separator indices of the partitions to beconstructed by future PROBEethBTHORN calls with B4Bt willnever be to the left of the respective separator indices ofPt Moreover at least one of the separators shouldmove right for feasibility since the load of the lastprocessor determines infeasibility of the current Bt value(ie LP4Bt) To avoid missing the smallest feasiblebottleneck value the next larger B value is selected asthe minimum of processor loads that will be obtained bymoving the end-index of every processor to the right byone position That is the next larger B value is equal tominfmin1ppoP fLp thorn wspthorn1gLPg Here we call theLp thorn wspthorn1 value the bid of processor Pp which refersto the load of Pp if the first task tspthorn1 of the nextprocessor is augmented to Pp The bid of the lastprocessor PP is equal to the load of the remaining tasksIf the smallest bid B comes from processor Pb probingwith new B is performed only for the remainingprocessors PbPbthorn1yPPS in the suffix Wsb1thorn1Nof the W arrayThe bidding algorithm is presented in Fig 7 The

innermost whilendashloop implements a linear probingscheme such that the new positions of the separatorsare determined by moving them to the right one by oneThis linear probing scheme is selected because newpositions of separators are likely to be in a closeneighborhood of previous ones Note that binary searchis used only for setting the separator indices for the firsttime After the separator index sp is set for processor Pp

during linear probing the repeatndashuntil-loop terminates ifit is not possible to partition the remaining subchainT spthorn1N into P p processors without exceeding thecurrent B value ie rbid frac14 Lr=ethP pTHORN4B where Lr

denotes the weight of the remaining subchain In thiscase the next larger B value is determined by consider-ing the best bid among the first p processors and rbid

As seen in Fig 7 we maintain a prefix-minimumarray BIDS for computing the next larger B value HereBIDS is an array of records of length P whereBIDSfrac12 pB and BIDSfrac12 pb store the best bid value ofthe first p processors and the index of the definingprocessor respectively BIDSfrac120 helps the correctness ofthe running prefix-minimum operationThe complexity of the bidding algorithm for integer

task weights under model M is OethN thorn P log N thornP wmax thorn P2ethwmax=wavgTHORNTHORN Here OethNTHORN cost comes fromthe initial prefix-sum operation on the W array andOethP log NTHORN cost comes from initial settings of separatorsthrough binary search The B value is increased at mostBopt B13owmax times and each time the next B valuecan be computed in OethPTHORN time which induces the costOethP wmaxTHORN The total area scanned by the separators is atmost OethP2ethwmax=wavgTHORNTHORN For noninteger task weightscomplexity can reach OethP log N thorn P3ethwmax=wavgTHORNTHORN inthe worst case which occurs when only one separatorindex moves to the right by one position at each B valueWe should note here that using a min-heap for findingthe next B value enables terminating a repeat-loopiteration as soon as a separator-index does not moveThe trade-off in this scheme is the Oethlog PTHORN cost incurredat each separator-index move due to respectivekey-update operations on the heap We implementedthis scheme as well but observed increased executiontimes

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 979

341 Bisection as an approximation algorithm

Let f ethBTHORN be the binary-valued function where f ethBTHORN frac141 if PROBEethBTHORN is true and f ethBTHORN frac14 0 if PROBEethBTHORN isfalse Clearly f ethBTHORN is nondecreasing in B and Bopt liesbetween LB frac14 B13 frac14 Wtot=P and UB frac14 Wtot Theseobservations are exploited in the bisection algorithmleading to an efficient e-approximate algorithm where eis the desired precision The interval frac12Wtot=PWtot isconceptually discretized into ethWtot Wtot=PTHORN=e bottle-neck values and binary search is used in this range tofind the minimum feasible bottleneck value Bopt Thebisection algorithm as illustrated in Fig 4 performsOethlogethWtot=eTHORNTHORN PROBE calls and each PROBE callcosts OethP logethN=PTHORNTHORN Hence the bisection algorithmruns in OethN thorn P logethN=PTHORN logethWtot=eTHORNTHORN time whereOethNTHORN cost comes from the initial prefix-sum operationon W The performance of this algorithm deteriorateswhen logethWtot=eTHORN is comparable with N

342 Nicolrsquos algorithm

Nicolrsquos algorithm [27] exploits the fact that anycandidate B value is equal to weight Wij of a subchainA naive solution is to generate all subchain weights ofthe form Wij sort them and then use binary search tofind the minimum Wab value for which PROBEethWabTHORN frac14TRUE Nicolrsquos algorithm efficiently searches for theearliest range Wab for which Bopt frac14 Wab by consideringeach processor in order as a candidate bottleneckprocessor in an optimal mapping Let Popt be theoptimal mapping constructed by greedy PROBEethBoptTHORNand let processor Pb be the first bottleneck processorwith load Bopt in Popt frac14 s0 s1y sby sPS Underthese assumptions this greedy construction of Popt

ensures that each processor Pp preceding Pb is loadedas much as possible with LpoBopt for p frac14 1 2y b 1in Popt Here PROBEethLpTHORN frac14 FALSE since LpoBoptand PROBEethLp thorn Wspthorn1THORN frac14 TRUE since adding onemore task to processor Pp increases its load to Lp thornwspthorn14Bopt Hence if b frac14 1 then s1 is equal to thesmallest index i1 such that PROBEethW1i1THORN frac14 TRUE and

Fig 4 Bisection as an e-approximation algorithm

Bopt frac14 B1 frac14 W1s1 However if b41 then because of thegreedy choice property P1 should be loaded as much aspossible without exceeding Bopt frac14 BboB1 which impliesthat s1 frac14 i1 1 and hence L1 frac14 W1i11 If b frac14 2 then s2is equal to the smallest index i2 such thatPROBEethWi1i2THORN frac14 TRUE and Bopt frac14 B2 frac14 Wi1i2 Ifb42 then s2 frac14 i2 1 We iterate for b frac14 1 2yP 1 computing ib as the smallest index for whichPROBEethWib1ibTHORN frac14 TRUE and save Bb frac14 Wib1ib withiP frac14 N Finally the optimal bottleneck value is selectedas Bopt frac14 min1pbpP BbFig 5 illustrates Nicolrsquos algorithm As seen in this

figure given ib1 ib is found by performing a binarysearch over all subchain weights of the form Wib1j forib1pjpN in the bth iteration of the forndashloop HenceNicolrsquos algorithm performs Oethlog NTHORN PROBE calls tofind ib at iteration b and each probe call costsOethP logethN=PTHORNTHORN Thus the cost of computing anindividual Bb value is OethP log N logethN=PTHORNTHORN Since P 1 such Bb values are computed the overall complexity ofNicolrsquos algorithm is OethN thorn P2 log N logethN=PTHORNTHORN whereOethNTHORN cost comes from the initial prefix-sum operationon WTwo possible implementations of Nicolrsquos algorithm

are presented in Fig 5 Fig 5(a) illustrates a straightfor-ward implementation whereas Fig 5(b) illustrates acareful implementation which maintains and uses theinformation from previous probes to answer withoutcalling the PROBE function As seen in Fig 5(b) thisinformation is efficiently maintained as an undeterminedbottleneck-value range ethLBUBTHORN which is dynamicallyrefined after each probe Any bottleneck value encoun-tered outside the current range is immediately acceptedor rejected Although this simple scheme does notimprove the asymptotic complexity of the algorithm itdrastically reduces the number of probes as discussed inSection 6

4 Proposed CCP algorithms

In this section we present proposed methods Firstwe describe how to bound the separator indices for anoptimal solution to reduce the search space Then weshow how this technique can be used to improve theperformance of the dynamic programming algorithmWe continue with our discussion on improving the MSalgorithm and propose a novel refinement algorithmwhich we call the bidding algorithm Finally we discussparametric search methods proposing improvementsfor the bisection and Nicolrsquos algorithms

41 Restricting the search space

Our proposed CCP algorithms exploit lower andupper bounds on the optimal bottleneck value to restrict

ARTICLE IN PRESS

Fig 5 Nicolrsquos [27] algorithm (a) straightforward implementation (b) careful implementation with dynamic bottleneck-value bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996980

the search space for separator values as a preprocessingstep Natural lower and upper bounds for the optimalbottleneck value Bopt of a given CCP problem instanceethWNPTHORN are LB frac14 maxfB13wmaxg and UB frac14B13 thorn wmax respectively where B13 frac14 Wtot=P SincewmaxoB13 in coarse grain parallelization of most real-world applications our presentation will be forwmaxoB13 frac14 LB even though all results are valid whenB13 is replaced with maxfB13wmaxg The following lemmadescribes how to use these natural bounds on Bopt torestrict the search space for the separator values

Lemma 41 For a given CCP problem instance

ethWNPTHORN if Bf is a feasible bottleneck value in the

range frac12B13B13 thorn wmax then there exists a partition P frac14s0 s1y sPS of cost CethPTHORNpBf with SLppsppSHpfor 1ppoP where SLp and SHp are respectively the

smallest and largest indices such that

W1SLpXpethB13 wmaxethP pTHORN=PTHORN and

W1SHpppethB13 thorn wmaxethP pTHORN=PTHORN

Proof Let Bf frac14 B13 thorn w where 0pwowmax Partition Pcan be constructed by PROBEethBTHORN which loads the firstp processors as much as possible subject to LqpBf forq frac14 1 2y p In the worst case wspthorn1 frac14 wmax for eachof the first p processors Thus we have W1sp

Xf ethwTHORN frac14pethB13 thorn w wmaxTHORN for p frac14 1 2yP 1 However it

should be possible to divide the remaining subchainT spthorn1N into P p parts without exceeding Bf ieWspthorn1NpethP pTHORN ethB13 thorn wTHORN Thus we also haveW1sp

XgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn wTHORN Note that f ethwTHORNis an increasing function of w whereas gethwTHORN is adecreasing function of w The minimum ofmaxf f ethwTHORN gethwTHORNg is at the intersection of f ethwTHORN andgethwTHORN so that W1sp

XpethB13 wmaxethP pTHORN=PTHORNTo prove the upper bounds we can start with

W1sppf ethwTHORN frac14 pethB13 thorn wTHORN which holds when Lq frac14 B13 thorn

w for q frac14 1y p The condition Wspthorn1NXethP pTHORNethB13 thornw wmaxTHORN however ensures feasibility of Bf frac14 B13 thorn wsince PROBEethBTHORN can always load each of the remainingethP pTHORN processors with B13 thorn w wmax Thus we alsohave W1sp

pgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn w wmaxTHORNHere f ethwTHORN is an increasing function of w whereasgethwTHORN is a decreasing function of w which yieldsW1sp

ppethB13 thorn wmaxethP pTHORN=PTHORN amp

Corollary 42 The separator range weights are DWp frac14PSHp

ifrac14SLpwi frac14 W1SHp

W1SLpfrac14 2wmaxpethP pTHORN=P with a

maximum value Pwmax=2 at p frac14 P=2

Applying this corollary requires finding wmax whichentails an overhead equivalent to that of the prefix-sumoperation and hence should be avoided In this workwe adopt a practical scheme to construct the bounds onseparator indices We run the RB heuristic to find a

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 981

hopefully good bottleneck value BRB and use BRB as anupper bound for bottleneck values ie UB frac14 BRBThen we run LR-PROBEethBRBTHORN and RL-PROBEethBRBTHORNto construct two mappings P1 frac14 h10 h11y h1PS andP2 frac14 c20 c

21y c2PS with CethP1THORNCethP2THORNpBRB Here

LR-PROBE denotes the left-to-right probe given inFig 3 whereas RL-PROBE denotes a right-to-leftprobe function which can be considered as the dual ofthe LR-PROBE RL-PROBE exploits the greedy-choiceproperty from right to left That is RL-PROBE assignssubchains from the right end towards the left end of thetask chain to processors in the order PPPP1yP1From these two mappings lower and upper boundvalues for sp separator indices are constructed as SLp frac14c2p and SHp frac14 h1p respectively These bounds are furtherrefined by running LR-PROBEethB13THORN andRL-PROBEethB13THORN to construct two mappings P3 frac14c30 c

31y c3PS and P4 frac14 h40 h41y h4PS and then

defining SLp frac14 maxfSLp c3pg and SHp frac14 minfSHp h4pg

for 1ppoP Lemmas 43 and 44 prove correctness ofthese bounds

Lemma 43 For a given CCP problem instance ethWNPTHORNand a feasible bottleneck value Bf let P1 frac14h10 h11y h1PS and P2 frac14 c20 c

21y c2PS be the parti-

tions constructed by LR-PROBEethBfTHORN and RL-PROBEethBfTHORNrespectively Then any partition P frac14 s0 s1y sPS of cost

CethPTHORN frac14 BpBf satisfies c2ppspph1p

Proof By the property of LR-PROBEethBfTHORN h1p is thelargest index such that T 1h1p

can be partitioned into p

parts without exceeding Bf If sp4h1p then the bottle-neck value will exceed Bf and thus B By the property ofRL-PROBEethBfTHORN c2p is the smallest index where T c2pNcan be partitioned into P p parts without exceedingBf If spoc2p then the bottleneck value will exceed Bf

and thus B amp

Lemma 44 For a given CCP problem instance

ethWNPTHORN let P3 frac14 c30 c31y c3PS and P4 frac14

h40 h41y h4PS be the partitions constructed by

LR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value Bf there exists a

partition P frac14 s0 s1y sPS of cost CethPTHORNpBf that

satisfies c3ppspph4p

Proof Consider the partition P frac14 s0 s1y sPS con-structed by LR-PROBEethBfTHORN It is clear that thispartition already satisfies the lower bounds iespXc3p Assume sp4h4p then partition P0 obtained bymoving sp back to h4p also yields a partition with costCethP0THORNpBf since T h4pthorn1N can be partitioned into P p

parts without exceeding B13 amp

The difference between Lemmas 43 and 44 is that theformer ensures the existence of all partitions with

costpBf within the given separator-index rangeswhereas the latter ensures only the existence of at leastone such partition within the given ranges Thefollowing corollary combines the results of these twolemmas

Corollary 45 For a given CCP problem instance

ethWNPTHORN and a feasible bottleneck value Bf let

P1 frac14 h10 h11y h1PS P2 frac14 c20 c21y c2PS P3 frac14

c30 c31y c3PS and P4 frac14 h40 h

41y h4PS be the parti-

tions constructed by LR-PROBEethBfTHORN RL-PROBEethBfTHORNLR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value B in the range

frac12B13Bf there exists a partition P frac14 s0 s1y sPS of

cost CethPTHORNpB with SLppsppSHp for 1ppoP where

SLp frac14 maxfc2p c3pg and SHp frac14 minfh1p h4pg

Corollary 46 The separator range weights become

DWp frac14 2 minf p ethP pTHORNgwmax in the worst case with a

maximum value Pwmax at p frac14 P=2

Lemma 41 and Corollary 45 infer the followingtheorem since B13pBoptpB13 thorn wmax

Theorem 47 For a given CCP problem instance

ethWNPTHORN and SLp and SHp index bounds constructed

according to Lemma 41 or Corollary 45 there exists

an optimal partition Popt frac14 s0 s1y sPS with SLppsppSHp for 1ppoP

Comparison of separator range weights in Lemma 41and Corollary 45 shows that separator range weightsproduced by the practical scheme described inCorollary 45 may be worse than those of Lemma 41by a factor of two This is only the worst-case behaviorhowever and the practical scheme normally finds muchbetter bounds since order in the chain usually preventsthe worst-case behavior and BRBoB13 thorn wmax Experi-mental results in Section 6 justify this expectation

411 Complexity analysis models

Corollaries 42 and 46 give bounds on the weights ofthe separator-index ranges However we need boundson the sizes of these separator-index ranges forcomputational complexity analysis of the proposedCCP algorithms Here the size DSp frac14 SHp SLp thorn 1denotes the number of tasks within the pth rangefrac12SLpSHp Miguet and Pierson [24] propose the modelwi frac14 yethwavgTHORN for i frac14 1 2yN to prove that their H1and H2 heuristics allocate yethN=PTHORN tasks to eachprocessor where wavg frac14 Wtot=N is the average taskweight This assumption means that the weight of eachtask is not too far away from the average task weightUsing Corollaries 42 and 45 this model induces DSp frac14OethP wmax=wavgTHORN Moreover this model can be exploitedto induce the optimistic bound DSp frac14 OethPTHORN However

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996982

we find their model too restrictive since the minimumand maximum task weights can deviate substantiallyfrom wavg Here we establish a looser and more realisticmodel on task weights so that for any subchain T ij

with weight Wij sufficiently larger than wmax theaverage task weight within subchain T ij satisfiesOethwavgTHORN That is Dij frac14 j i thorn 1 frac14 OethWij=wavgTHORN Thismodel referred to here as model M directly inducesDSp frac14 OethPwmax=wavgTHORN since DWppDWP=2 frac14 Pwmax=2for p frac14 1 2yP 1

42 Dynamic-programming algorithm with static

separator-index bounding

The proposed DP algorithm referred to here as theDP+ algorithm exploits bounds on the separatorindices for an efficient solution Fig 6 illustrates theproposed DP+ algorithm where input parameters SL

and SH denote the index bound arrays each of size Pcomputed according to Corollary 45 with Bf frac14 BRBNote that SLP frac14 SHP frac14 N since only Bfrac12PN needs tobe computed in the last row As seen in Fig 6 only B

pj

values for j frac14 SLpSLp thorn 1ySHp are computed ateach row p by exploiting Corollary 45 which ensuresexistence of an optimal partition Popt frac14 s0 s1y sPSwith SLppsppSHp Thus these B

pj values will suffice

for correct computation of Bpthorn1i values for i frac14

SLpthorn1SLpthorn1 thorn 1ySHpthorn1 at the next row p thorn 1As seen in Fig 6 explicit range checking is avoided in

this algorithm for utmost efficiency However the j-index may proceed beyond SHp to SHp thorn 1 within therepeatndashuntil-loop while computing B

pthorn1i with

SLpthorn1pipSHpthorn1 in two cases In both cases functionsWjthorn1i and B

pj intersect in the open interval ethSHpSHp thorn

1THORN so that BpSHp

oWSHpthorn1i and BpSHpthorn1XWSHpthorn2i In the

first case i frac14 SHp thorn 1 so that Wjthorn1i and Bpj intersect in

Fig 6 Dynamic-programming algorithm with static separator-index

bounding

ethi 1 iTHORN which implies that Bpthorn1i frac14 Wi1i with j

pthorn1i frac14

SLp since Wi1ioBpi as mentioned in Section 32 In

the second case i4SHp thorn 1 for which Corollary 45guarantees that B

pthorn1i frac14 WSLpthorn1ipB

pSHpthorn1 and thus we

can safely select jpthorn1i frac14 SLp Note that WSLpthorn1i frac14

BpSHpthorn1 may correspond to a case leading to another

optimal partition with jpthorn1i frac14 spthorn1 frac14 SHp thorn 1 As seen in

Fig 6 both cases are efficiently resolved simply bystoring N to B

pSHpthorn1 as a sentinel Hence in such cases

the condition WSHpthorn1ioBpSHpthorn1 frac14 N in the ifndashthen

statement following the repeatndashuntil-loop is always trueso that the j-index automatically moves back to SHpThe scheme of computing B

pSHpthorn1 for each row p which

seems to be a natural solution does not work sincecorrect computation of B

pthorn1SHpthorn1thorn1 may necessitate more

than one Bpj value beyond the SHp index bound

A nice feature of the DP approach is that it can beused to generate all optimal partitions by maintaining aP N matrix to store the minimum j

pi index values

defining the Bpi values at the expense of increased

execution time and increased asymptotic space require-ment Recall that index bounds SL and SH computedaccording to Corollary 45 restrict the search space forat least one optimal solution The index bounds can becomputed according to Lemma 44 for this purposesince the search space restricted by Lemma 44 includesall optimal solutionsThe running time of the proposed DP+ algorithm is

OethN thorn P log NTHORN thornPP

pfrac141 yethDSpTHORN Here OethNTHORN cost comesfrom the initial prefix-sum operation on the W arrayand OethP log NTHORN cost comes from the running time of theRB heuristic and computing the separator-index boundsSL and SH according to Corollary 45 Under modelM DSp frac14 OethPwmax=wavgTHORN and hence the complexity isOethN thorn P log N thorn P2 wmax = wavgTHORN The algorithm be-comes linear in N when the separator-index ranges donot overlap which is guaranteed by the conditionwmax frac14 Oeth2Wtot=P2THORN

43 Iterative refinement algorithms

In this work we improve the MS algorithm andpropose a novel CCP algorithm namely the bidding

algorithm which is run-time efficient for small-to-medium number of processors The main differencebetween the MS and the bidding algorithms is asfollows the MS algorithm moves along a series offeasible bottleneck values whereas the bidding algo-rithm moves along a sequence of infeasible bottleneckvalues so that the first feasible bottleneck value becomesthe optimal value

431 Improving the MS algorithm

The performance of the MS algorithm stronglydepends on the initial partition The initial partitionproposed by Manne and S^revik [23] satisfies the leftist

ARTICLE IN PRESS

Fig 7 Bidding algorithm

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 983

partition constraint but it leads to very poor run-timeperformance Here we propose using the partitiongenerated by PROBEethB13THORN as an initial partition Thispartition is also a leftist partition since moving anyseparator to the left will not decrease the load of thebottleneck processor This simple observation leads tosignificant improvement in run-time performance of thealgorithm Also using a heap as a priority queue does notgive better run-time performance than using a runningmaximum despite its superior asymptotic complexity Inour implementation we use a running maximum

432 Bidding algorithm

This algorithm increases the bottleneck value gradu-ally starting from the ideal bottleneck value B13 until itfinds a feasible partition which is also optimal Considera partition Pt frac14 s0 s1y sPS constructed byPROBEethBtTHORN for an infeasible Bt After detecting theinfeasibility of this Bt value the point is to determine thenext larger bottleneck value B to be investigatedClearly the separator indices of the partitions to beconstructed by future PROBEethBTHORN calls with B4Bt willnever be to the left of the respective separator indices ofPt Moreover at least one of the separators shouldmove right for feasibility since the load of the lastprocessor determines infeasibility of the current Bt value(ie LP4Bt) To avoid missing the smallest feasiblebottleneck value the next larger B value is selected asthe minimum of processor loads that will be obtained bymoving the end-index of every processor to the right byone position That is the next larger B value is equal tominfmin1ppoP fLp thorn wspthorn1gLPg Here we call theLp thorn wspthorn1 value the bid of processor Pp which refersto the load of Pp if the first task tspthorn1 of the nextprocessor is augmented to Pp The bid of the lastprocessor PP is equal to the load of the remaining tasksIf the smallest bid B comes from processor Pb probingwith new B is performed only for the remainingprocessors PbPbthorn1yPPS in the suffix Wsb1thorn1Nof the W arrayThe bidding algorithm is presented in Fig 7 The

innermost whilendashloop implements a linear probingscheme such that the new positions of the separatorsare determined by moving them to the right one by oneThis linear probing scheme is selected because newpositions of separators are likely to be in a closeneighborhood of previous ones Note that binary searchis used only for setting the separator indices for the firsttime After the separator index sp is set for processor Pp

during linear probing the repeatndashuntil-loop terminates ifit is not possible to partition the remaining subchainT spthorn1N into P p processors without exceeding thecurrent B value ie rbid frac14 Lr=ethP pTHORN4B where Lr

denotes the weight of the remaining subchain In thiscase the next larger B value is determined by consider-ing the best bid among the first p processors and rbid

As seen in Fig 7 we maintain a prefix-minimumarray BIDS for computing the next larger B value HereBIDS is an array of records of length P whereBIDSfrac12 pB and BIDSfrac12 pb store the best bid value ofthe first p processors and the index of the definingprocessor respectively BIDSfrac120 helps the correctness ofthe running prefix-minimum operationThe complexity of the bidding algorithm for integer

task weights under model M is OethN thorn P log N thornP wmax thorn P2ethwmax=wavgTHORNTHORN Here OethNTHORN cost comes fromthe initial prefix-sum operation on the W array andOethP log NTHORN cost comes from initial settings of separatorsthrough binary search The B value is increased at mostBopt B13owmax times and each time the next B valuecan be computed in OethPTHORN time which induces the costOethP wmaxTHORN The total area scanned by the separators is atmost OethP2ethwmax=wavgTHORNTHORN For noninteger task weightscomplexity can reach OethP log N thorn P3ethwmax=wavgTHORNTHORN inthe worst case which occurs when only one separatorindex moves to the right by one position at each B valueWe should note here that using a min-heap for findingthe next B value enables terminating a repeat-loopiteration as soon as a separator-index does not moveThe trade-off in this scheme is the Oethlog PTHORN cost incurredat each separator-index move due to respectivekey-update operations on the heap We implementedthis scheme as well but observed increased executiontimes

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Fig 5 Nicolrsquos [27] algorithm (a) straightforward implementation (b) careful implementation with dynamic bottleneck-value bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996980

the search space for separator values as a preprocessingstep Natural lower and upper bounds for the optimalbottleneck value Bopt of a given CCP problem instanceethWNPTHORN are LB frac14 maxfB13wmaxg and UB frac14B13 thorn wmax respectively where B13 frac14 Wtot=P SincewmaxoB13 in coarse grain parallelization of most real-world applications our presentation will be forwmaxoB13 frac14 LB even though all results are valid whenB13 is replaced with maxfB13wmaxg The following lemmadescribes how to use these natural bounds on Bopt torestrict the search space for the separator values

Lemma 41 For a given CCP problem instance

ethWNPTHORN if Bf is a feasible bottleneck value in the

range frac12B13B13 thorn wmax then there exists a partition P frac14s0 s1y sPS of cost CethPTHORNpBf with SLppsppSHpfor 1ppoP where SLp and SHp are respectively the

smallest and largest indices such that

W1SLpXpethB13 wmaxethP pTHORN=PTHORN and

W1SHpppethB13 thorn wmaxethP pTHORN=PTHORN

Proof Let Bf frac14 B13 thorn w where 0pwowmax Partition Pcan be constructed by PROBEethBTHORN which loads the firstp processors as much as possible subject to LqpBf forq frac14 1 2y p In the worst case wspthorn1 frac14 wmax for eachof the first p processors Thus we have W1sp

Xf ethwTHORN frac14pethB13 thorn w wmaxTHORN for p frac14 1 2yP 1 However it

should be possible to divide the remaining subchainT spthorn1N into P p parts without exceeding Bf ieWspthorn1NpethP pTHORN ethB13 thorn wTHORN Thus we also haveW1sp

XgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn wTHORN Note that f ethwTHORNis an increasing function of w whereas gethwTHORN is adecreasing function of w The minimum ofmaxf f ethwTHORN gethwTHORNg is at the intersection of f ethwTHORN andgethwTHORN so that W1sp

XpethB13 wmaxethP pTHORN=PTHORNTo prove the upper bounds we can start with

W1sppf ethwTHORN frac14 pethB13 thorn wTHORN which holds when Lq frac14 B13 thorn

w for q frac14 1y p The condition Wspthorn1NXethP pTHORNethB13 thornw wmaxTHORN however ensures feasibility of Bf frac14 B13 thorn wsince PROBEethBTHORN can always load each of the remainingethP pTHORN processors with B13 thorn w wmax Thus we alsohave W1sp

pgethwTHORN frac14 Wtot ethP pTHORNethB13 thorn w wmaxTHORNHere f ethwTHORN is an increasing function of w whereasgethwTHORN is a decreasing function of w which yieldsW1sp

ppethB13 thorn wmaxethP pTHORN=PTHORN amp

Corollary 42 The separator range weights are DWp frac14PSHp

ifrac14SLpwi frac14 W1SHp

W1SLpfrac14 2wmaxpethP pTHORN=P with a

maximum value Pwmax=2 at p frac14 P=2

Applying this corollary requires finding wmax whichentails an overhead equivalent to that of the prefix-sumoperation and hence should be avoided In this workwe adopt a practical scheme to construct the bounds onseparator indices We run the RB heuristic to find a

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 981

hopefully good bottleneck value BRB and use BRB as anupper bound for bottleneck values ie UB frac14 BRBThen we run LR-PROBEethBRBTHORN and RL-PROBEethBRBTHORNto construct two mappings P1 frac14 h10 h11y h1PS andP2 frac14 c20 c

21y c2PS with CethP1THORNCethP2THORNpBRB Here

LR-PROBE denotes the left-to-right probe given inFig 3 whereas RL-PROBE denotes a right-to-leftprobe function which can be considered as the dual ofthe LR-PROBE RL-PROBE exploits the greedy-choiceproperty from right to left That is RL-PROBE assignssubchains from the right end towards the left end of thetask chain to processors in the order PPPP1yP1From these two mappings lower and upper boundvalues for sp separator indices are constructed as SLp frac14c2p and SHp frac14 h1p respectively These bounds are furtherrefined by running LR-PROBEethB13THORN andRL-PROBEethB13THORN to construct two mappings P3 frac14c30 c

31y c3PS and P4 frac14 h40 h41y h4PS and then

defining SLp frac14 maxfSLp c3pg and SHp frac14 minfSHp h4pg

for 1ppoP Lemmas 43 and 44 prove correctness ofthese bounds

Lemma 43 For a given CCP problem instance ethWNPTHORNand a feasible bottleneck value Bf let P1 frac14h10 h11y h1PS and P2 frac14 c20 c

21y c2PS be the parti-

tions constructed by LR-PROBEethBfTHORN and RL-PROBEethBfTHORNrespectively Then any partition P frac14 s0 s1y sPS of cost

CethPTHORN frac14 BpBf satisfies c2ppspph1p

Proof By the property of LR-PROBEethBfTHORN h1p is thelargest index such that T 1h1p

can be partitioned into p

parts without exceeding Bf If sp4h1p then the bottle-neck value will exceed Bf and thus B By the property ofRL-PROBEethBfTHORN c2p is the smallest index where T c2pNcan be partitioned into P p parts without exceedingBf If spoc2p then the bottleneck value will exceed Bf

and thus B amp

Lemma 44 For a given CCP problem instance

ethWNPTHORN let P3 frac14 c30 c31y c3PS and P4 frac14

h40 h41y h4PS be the partitions constructed by

LR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value Bf there exists a

partition P frac14 s0 s1y sPS of cost CethPTHORNpBf that

satisfies c3ppspph4p

Proof Consider the partition P frac14 s0 s1y sPS con-structed by LR-PROBEethBfTHORN It is clear that thispartition already satisfies the lower bounds iespXc3p Assume sp4h4p then partition P0 obtained bymoving sp back to h4p also yields a partition with costCethP0THORNpBf since T h4pthorn1N can be partitioned into P p

parts without exceeding B13 amp

The difference between Lemmas 43 and 44 is that theformer ensures the existence of all partitions with

costpBf within the given separator-index rangeswhereas the latter ensures only the existence of at leastone such partition within the given ranges Thefollowing corollary combines the results of these twolemmas

Corollary 45 For a given CCP problem instance

ethWNPTHORN and a feasible bottleneck value Bf let

P1 frac14 h10 h11y h1PS P2 frac14 c20 c21y c2PS P3 frac14

c30 c31y c3PS and P4 frac14 h40 h

41y h4PS be the parti-

tions constructed by LR-PROBEethBfTHORN RL-PROBEethBfTHORNLR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value B in the range

frac12B13Bf there exists a partition P frac14 s0 s1y sPS of

cost CethPTHORNpB with SLppsppSHp for 1ppoP where

SLp frac14 maxfc2p c3pg and SHp frac14 minfh1p h4pg

Corollary 46 The separator range weights become

DWp frac14 2 minf p ethP pTHORNgwmax in the worst case with a

maximum value Pwmax at p frac14 P=2

Lemma 41 and Corollary 45 infer the followingtheorem since B13pBoptpB13 thorn wmax

Theorem 47 For a given CCP problem instance

ethWNPTHORN and SLp and SHp index bounds constructed

according to Lemma 41 or Corollary 45 there exists

an optimal partition Popt frac14 s0 s1y sPS with SLppsppSHp for 1ppoP

Comparison of separator range weights in Lemma 41and Corollary 45 shows that separator range weightsproduced by the practical scheme described inCorollary 45 may be worse than those of Lemma 41by a factor of two This is only the worst-case behaviorhowever and the practical scheme normally finds muchbetter bounds since order in the chain usually preventsthe worst-case behavior and BRBoB13 thorn wmax Experi-mental results in Section 6 justify this expectation

411 Complexity analysis models

Corollaries 42 and 46 give bounds on the weights ofthe separator-index ranges However we need boundson the sizes of these separator-index ranges forcomputational complexity analysis of the proposedCCP algorithms Here the size DSp frac14 SHp SLp thorn 1denotes the number of tasks within the pth rangefrac12SLpSHp Miguet and Pierson [24] propose the modelwi frac14 yethwavgTHORN for i frac14 1 2yN to prove that their H1and H2 heuristics allocate yethN=PTHORN tasks to eachprocessor where wavg frac14 Wtot=N is the average taskweight This assumption means that the weight of eachtask is not too far away from the average task weightUsing Corollaries 42 and 45 this model induces DSp frac14OethP wmax=wavgTHORN Moreover this model can be exploitedto induce the optimistic bound DSp frac14 OethPTHORN However

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996982

we find their model too restrictive since the minimumand maximum task weights can deviate substantiallyfrom wavg Here we establish a looser and more realisticmodel on task weights so that for any subchain T ij

with weight Wij sufficiently larger than wmax theaverage task weight within subchain T ij satisfiesOethwavgTHORN That is Dij frac14 j i thorn 1 frac14 OethWij=wavgTHORN Thismodel referred to here as model M directly inducesDSp frac14 OethPwmax=wavgTHORN since DWppDWP=2 frac14 Pwmax=2for p frac14 1 2yP 1

42 Dynamic-programming algorithm with static

separator-index bounding

The proposed DP algorithm referred to here as theDP+ algorithm exploits bounds on the separatorindices for an efficient solution Fig 6 illustrates theproposed DP+ algorithm where input parameters SL

and SH denote the index bound arrays each of size Pcomputed according to Corollary 45 with Bf frac14 BRBNote that SLP frac14 SHP frac14 N since only Bfrac12PN needs tobe computed in the last row As seen in Fig 6 only B

pj

values for j frac14 SLpSLp thorn 1ySHp are computed ateach row p by exploiting Corollary 45 which ensuresexistence of an optimal partition Popt frac14 s0 s1y sPSwith SLppsppSHp Thus these B

pj values will suffice

for correct computation of Bpthorn1i values for i frac14

SLpthorn1SLpthorn1 thorn 1ySHpthorn1 at the next row p thorn 1As seen in Fig 6 explicit range checking is avoided in

this algorithm for utmost efficiency However the j-index may proceed beyond SHp to SHp thorn 1 within therepeatndashuntil-loop while computing B

pthorn1i with

SLpthorn1pipSHpthorn1 in two cases In both cases functionsWjthorn1i and B

pj intersect in the open interval ethSHpSHp thorn

1THORN so that BpSHp

oWSHpthorn1i and BpSHpthorn1XWSHpthorn2i In the

first case i frac14 SHp thorn 1 so that Wjthorn1i and Bpj intersect in

Fig 6 Dynamic-programming algorithm with static separator-index

bounding

ethi 1 iTHORN which implies that Bpthorn1i frac14 Wi1i with j

pthorn1i frac14

SLp since Wi1ioBpi as mentioned in Section 32 In

the second case i4SHp thorn 1 for which Corollary 45guarantees that B

pthorn1i frac14 WSLpthorn1ipB

pSHpthorn1 and thus we

can safely select jpthorn1i frac14 SLp Note that WSLpthorn1i frac14

BpSHpthorn1 may correspond to a case leading to another

optimal partition with jpthorn1i frac14 spthorn1 frac14 SHp thorn 1 As seen in

Fig 6 both cases are efficiently resolved simply bystoring N to B

pSHpthorn1 as a sentinel Hence in such cases

the condition WSHpthorn1ioBpSHpthorn1 frac14 N in the ifndashthen

statement following the repeatndashuntil-loop is always trueso that the j-index automatically moves back to SHpThe scheme of computing B

pSHpthorn1 for each row p which

seems to be a natural solution does not work sincecorrect computation of B

pthorn1SHpthorn1thorn1 may necessitate more

than one Bpj value beyond the SHp index bound

A nice feature of the DP approach is that it can beused to generate all optimal partitions by maintaining aP N matrix to store the minimum j

pi index values

defining the Bpi values at the expense of increased

execution time and increased asymptotic space require-ment Recall that index bounds SL and SH computedaccording to Corollary 45 restrict the search space forat least one optimal solution The index bounds can becomputed according to Lemma 44 for this purposesince the search space restricted by Lemma 44 includesall optimal solutionsThe running time of the proposed DP+ algorithm is

OethN thorn P log NTHORN thornPP

pfrac141 yethDSpTHORN Here OethNTHORN cost comesfrom the initial prefix-sum operation on the W arrayand OethP log NTHORN cost comes from the running time of theRB heuristic and computing the separator-index boundsSL and SH according to Corollary 45 Under modelM DSp frac14 OethPwmax=wavgTHORN and hence the complexity isOethN thorn P log N thorn P2 wmax = wavgTHORN The algorithm be-comes linear in N when the separator-index ranges donot overlap which is guaranteed by the conditionwmax frac14 Oeth2Wtot=P2THORN

43 Iterative refinement algorithms

In this work we improve the MS algorithm andpropose a novel CCP algorithm namely the bidding

algorithm which is run-time efficient for small-to-medium number of processors The main differencebetween the MS and the bidding algorithms is asfollows the MS algorithm moves along a series offeasible bottleneck values whereas the bidding algo-rithm moves along a sequence of infeasible bottleneckvalues so that the first feasible bottleneck value becomesthe optimal value

431 Improving the MS algorithm

The performance of the MS algorithm stronglydepends on the initial partition The initial partitionproposed by Manne and S^revik [23] satisfies the leftist

ARTICLE IN PRESS

Fig 7 Bidding algorithm

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 983

partition constraint but it leads to very poor run-timeperformance Here we propose using the partitiongenerated by PROBEethB13THORN as an initial partition Thispartition is also a leftist partition since moving anyseparator to the left will not decrease the load of thebottleneck processor This simple observation leads tosignificant improvement in run-time performance of thealgorithm Also using a heap as a priority queue does notgive better run-time performance than using a runningmaximum despite its superior asymptotic complexity Inour implementation we use a running maximum

432 Bidding algorithm

This algorithm increases the bottleneck value gradu-ally starting from the ideal bottleneck value B13 until itfinds a feasible partition which is also optimal Considera partition Pt frac14 s0 s1y sPS constructed byPROBEethBtTHORN for an infeasible Bt After detecting theinfeasibility of this Bt value the point is to determine thenext larger bottleneck value B to be investigatedClearly the separator indices of the partitions to beconstructed by future PROBEethBTHORN calls with B4Bt willnever be to the left of the respective separator indices ofPt Moreover at least one of the separators shouldmove right for feasibility since the load of the lastprocessor determines infeasibility of the current Bt value(ie LP4Bt) To avoid missing the smallest feasiblebottleneck value the next larger B value is selected asthe minimum of processor loads that will be obtained bymoving the end-index of every processor to the right byone position That is the next larger B value is equal tominfmin1ppoP fLp thorn wspthorn1gLPg Here we call theLp thorn wspthorn1 value the bid of processor Pp which refersto the load of Pp if the first task tspthorn1 of the nextprocessor is augmented to Pp The bid of the lastprocessor PP is equal to the load of the remaining tasksIf the smallest bid B comes from processor Pb probingwith new B is performed only for the remainingprocessors PbPbthorn1yPPS in the suffix Wsb1thorn1Nof the W arrayThe bidding algorithm is presented in Fig 7 The

innermost whilendashloop implements a linear probingscheme such that the new positions of the separatorsare determined by moving them to the right one by oneThis linear probing scheme is selected because newpositions of separators are likely to be in a closeneighborhood of previous ones Note that binary searchis used only for setting the separator indices for the firsttime After the separator index sp is set for processor Pp

during linear probing the repeatndashuntil-loop terminates ifit is not possible to partition the remaining subchainT spthorn1N into P p processors without exceeding thecurrent B value ie rbid frac14 Lr=ethP pTHORN4B where Lr

denotes the weight of the remaining subchain In thiscase the next larger B value is determined by consider-ing the best bid among the first p processors and rbid

As seen in Fig 7 we maintain a prefix-minimumarray BIDS for computing the next larger B value HereBIDS is an array of records of length P whereBIDSfrac12 pB and BIDSfrac12 pb store the best bid value ofthe first p processors and the index of the definingprocessor respectively BIDSfrac120 helps the correctness ofthe running prefix-minimum operationThe complexity of the bidding algorithm for integer

task weights under model M is OethN thorn P log N thornP wmax thorn P2ethwmax=wavgTHORNTHORN Here OethNTHORN cost comes fromthe initial prefix-sum operation on the W array andOethP log NTHORN cost comes from initial settings of separatorsthrough binary search The B value is increased at mostBopt B13owmax times and each time the next B valuecan be computed in OethPTHORN time which induces the costOethP wmaxTHORN The total area scanned by the separators is atmost OethP2ethwmax=wavgTHORNTHORN For noninteger task weightscomplexity can reach OethP log N thorn P3ethwmax=wavgTHORNTHORN inthe worst case which occurs when only one separatorindex moves to the right by one position at each B valueWe should note here that using a min-heap for findingthe next B value enables terminating a repeat-loopiteration as soon as a separator-index does not moveThe trade-off in this scheme is the Oethlog PTHORN cost incurredat each separator-index move due to respectivekey-update operations on the heap We implementedthis scheme as well but observed increased executiontimes

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 981

hopefully good bottleneck value BRB and use BRB as anupper bound for bottleneck values ie UB frac14 BRBThen we run LR-PROBEethBRBTHORN and RL-PROBEethBRBTHORNto construct two mappings P1 frac14 h10 h11y h1PS andP2 frac14 c20 c

21y c2PS with CethP1THORNCethP2THORNpBRB Here

LR-PROBE denotes the left-to-right probe given inFig 3 whereas RL-PROBE denotes a right-to-leftprobe function which can be considered as the dual ofthe LR-PROBE RL-PROBE exploits the greedy-choiceproperty from right to left That is RL-PROBE assignssubchains from the right end towards the left end of thetask chain to processors in the order PPPP1yP1From these two mappings lower and upper boundvalues for sp separator indices are constructed as SLp frac14c2p and SHp frac14 h1p respectively These bounds are furtherrefined by running LR-PROBEethB13THORN andRL-PROBEethB13THORN to construct two mappings P3 frac14c30 c

31y c3PS and P4 frac14 h40 h41y h4PS and then

defining SLp frac14 maxfSLp c3pg and SHp frac14 minfSHp h4pg

for 1ppoP Lemmas 43 and 44 prove correctness ofthese bounds

Lemma 43 For a given CCP problem instance ethWNPTHORNand a feasible bottleneck value Bf let P1 frac14h10 h11y h1PS and P2 frac14 c20 c

21y c2PS be the parti-

tions constructed by LR-PROBEethBfTHORN and RL-PROBEethBfTHORNrespectively Then any partition P frac14 s0 s1y sPS of cost

CethPTHORN frac14 BpBf satisfies c2ppspph1p

Proof By the property of LR-PROBEethBfTHORN h1p is thelargest index such that T 1h1p

can be partitioned into p

parts without exceeding Bf If sp4h1p then the bottle-neck value will exceed Bf and thus B By the property ofRL-PROBEethBfTHORN c2p is the smallest index where T c2pNcan be partitioned into P p parts without exceedingBf If spoc2p then the bottleneck value will exceed Bf

and thus B amp

Lemma 44 For a given CCP problem instance

ethWNPTHORN let P3 frac14 c30 c31y c3PS and P4 frac14

h40 h41y h4PS be the partitions constructed by

LR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value Bf there exists a

partition P frac14 s0 s1y sPS of cost CethPTHORNpBf that

satisfies c3ppspph4p

Proof Consider the partition P frac14 s0 s1y sPS con-structed by LR-PROBEethBfTHORN It is clear that thispartition already satisfies the lower bounds iespXc3p Assume sp4h4p then partition P0 obtained bymoving sp back to h4p also yields a partition with costCethP0THORNpBf since T h4pthorn1N can be partitioned into P p

parts without exceeding B13 amp

The difference between Lemmas 43 and 44 is that theformer ensures the existence of all partitions with

costpBf within the given separator-index rangeswhereas the latter ensures only the existence of at leastone such partition within the given ranges Thefollowing corollary combines the results of these twolemmas

Corollary 45 For a given CCP problem instance

ethWNPTHORN and a feasible bottleneck value Bf let

P1 frac14 h10 h11y h1PS P2 frac14 c20 c21y c2PS P3 frac14

c30 c31y c3PS and P4 frac14 h40 h

41y h4PS be the parti-

tions constructed by LR-PROBEethBfTHORN RL-PROBEethBfTHORNLR-PROBEethB13THORN and RL-PROBEethB13THORN respectively

Then for any feasible bottleneck value B in the range

frac12B13Bf there exists a partition P frac14 s0 s1y sPS of

cost CethPTHORNpB with SLppsppSHp for 1ppoP where

SLp frac14 maxfc2p c3pg and SHp frac14 minfh1p h4pg

Corollary 46 The separator range weights become

DWp frac14 2 minf p ethP pTHORNgwmax in the worst case with a

maximum value Pwmax at p frac14 P=2

Lemma 41 and Corollary 45 infer the followingtheorem since B13pBoptpB13 thorn wmax

Theorem 47 For a given CCP problem instance

ethWNPTHORN and SLp and SHp index bounds constructed

according to Lemma 41 or Corollary 45 there exists

an optimal partition Popt frac14 s0 s1y sPS with SLppsppSHp for 1ppoP

Comparison of separator range weights in Lemma 41and Corollary 45 shows that separator range weightsproduced by the practical scheme described inCorollary 45 may be worse than those of Lemma 41by a factor of two This is only the worst-case behaviorhowever and the practical scheme normally finds muchbetter bounds since order in the chain usually preventsthe worst-case behavior and BRBoB13 thorn wmax Experi-mental results in Section 6 justify this expectation

411 Complexity analysis models

Corollaries 42 and 46 give bounds on the weights ofthe separator-index ranges However we need boundson the sizes of these separator-index ranges forcomputational complexity analysis of the proposedCCP algorithms Here the size DSp frac14 SHp SLp thorn 1denotes the number of tasks within the pth rangefrac12SLpSHp Miguet and Pierson [24] propose the modelwi frac14 yethwavgTHORN for i frac14 1 2yN to prove that their H1and H2 heuristics allocate yethN=PTHORN tasks to eachprocessor where wavg frac14 Wtot=N is the average taskweight This assumption means that the weight of eachtask is not too far away from the average task weightUsing Corollaries 42 and 45 this model induces DSp frac14OethP wmax=wavgTHORN Moreover this model can be exploitedto induce the optimistic bound DSp frac14 OethPTHORN However

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996982

we find their model too restrictive since the minimumand maximum task weights can deviate substantiallyfrom wavg Here we establish a looser and more realisticmodel on task weights so that for any subchain T ij

with weight Wij sufficiently larger than wmax theaverage task weight within subchain T ij satisfiesOethwavgTHORN That is Dij frac14 j i thorn 1 frac14 OethWij=wavgTHORN Thismodel referred to here as model M directly inducesDSp frac14 OethPwmax=wavgTHORN since DWppDWP=2 frac14 Pwmax=2for p frac14 1 2yP 1

42 Dynamic-programming algorithm with static

separator-index bounding

The proposed DP algorithm referred to here as theDP+ algorithm exploits bounds on the separatorindices for an efficient solution Fig 6 illustrates theproposed DP+ algorithm where input parameters SL

and SH denote the index bound arrays each of size Pcomputed according to Corollary 45 with Bf frac14 BRBNote that SLP frac14 SHP frac14 N since only Bfrac12PN needs tobe computed in the last row As seen in Fig 6 only B

pj

values for j frac14 SLpSLp thorn 1ySHp are computed ateach row p by exploiting Corollary 45 which ensuresexistence of an optimal partition Popt frac14 s0 s1y sPSwith SLppsppSHp Thus these B

pj values will suffice

for correct computation of Bpthorn1i values for i frac14

SLpthorn1SLpthorn1 thorn 1ySHpthorn1 at the next row p thorn 1As seen in Fig 6 explicit range checking is avoided in

this algorithm for utmost efficiency However the j-index may proceed beyond SHp to SHp thorn 1 within therepeatndashuntil-loop while computing B

pthorn1i with

SLpthorn1pipSHpthorn1 in two cases In both cases functionsWjthorn1i and B

pj intersect in the open interval ethSHpSHp thorn

1THORN so that BpSHp

oWSHpthorn1i and BpSHpthorn1XWSHpthorn2i In the

first case i frac14 SHp thorn 1 so that Wjthorn1i and Bpj intersect in

Fig 6 Dynamic-programming algorithm with static separator-index

bounding

ethi 1 iTHORN which implies that Bpthorn1i frac14 Wi1i with j

pthorn1i frac14

SLp since Wi1ioBpi as mentioned in Section 32 In

the second case i4SHp thorn 1 for which Corollary 45guarantees that B

pthorn1i frac14 WSLpthorn1ipB

pSHpthorn1 and thus we

can safely select jpthorn1i frac14 SLp Note that WSLpthorn1i frac14

BpSHpthorn1 may correspond to a case leading to another

optimal partition with jpthorn1i frac14 spthorn1 frac14 SHp thorn 1 As seen in

Fig 6 both cases are efficiently resolved simply bystoring N to B

pSHpthorn1 as a sentinel Hence in such cases

the condition WSHpthorn1ioBpSHpthorn1 frac14 N in the ifndashthen

statement following the repeatndashuntil-loop is always trueso that the j-index automatically moves back to SHpThe scheme of computing B

pSHpthorn1 for each row p which

seems to be a natural solution does not work sincecorrect computation of B

pthorn1SHpthorn1thorn1 may necessitate more

than one Bpj value beyond the SHp index bound

A nice feature of the DP approach is that it can beused to generate all optimal partitions by maintaining aP N matrix to store the minimum j

pi index values

defining the Bpi values at the expense of increased

execution time and increased asymptotic space require-ment Recall that index bounds SL and SH computedaccording to Corollary 45 restrict the search space forat least one optimal solution The index bounds can becomputed according to Lemma 44 for this purposesince the search space restricted by Lemma 44 includesall optimal solutionsThe running time of the proposed DP+ algorithm is

OethN thorn P log NTHORN thornPP

pfrac141 yethDSpTHORN Here OethNTHORN cost comesfrom the initial prefix-sum operation on the W arrayand OethP log NTHORN cost comes from the running time of theRB heuristic and computing the separator-index boundsSL and SH according to Corollary 45 Under modelM DSp frac14 OethPwmax=wavgTHORN and hence the complexity isOethN thorn P log N thorn P2 wmax = wavgTHORN The algorithm be-comes linear in N when the separator-index ranges donot overlap which is guaranteed by the conditionwmax frac14 Oeth2Wtot=P2THORN

43 Iterative refinement algorithms

In this work we improve the MS algorithm andpropose a novel CCP algorithm namely the bidding

algorithm which is run-time efficient for small-to-medium number of processors The main differencebetween the MS and the bidding algorithms is asfollows the MS algorithm moves along a series offeasible bottleneck values whereas the bidding algo-rithm moves along a sequence of infeasible bottleneckvalues so that the first feasible bottleneck value becomesthe optimal value

431 Improving the MS algorithm

The performance of the MS algorithm stronglydepends on the initial partition The initial partitionproposed by Manne and S^revik [23] satisfies the leftist

ARTICLE IN PRESS

Fig 7 Bidding algorithm

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 983

partition constraint but it leads to very poor run-timeperformance Here we propose using the partitiongenerated by PROBEethB13THORN as an initial partition Thispartition is also a leftist partition since moving anyseparator to the left will not decrease the load of thebottleneck processor This simple observation leads tosignificant improvement in run-time performance of thealgorithm Also using a heap as a priority queue does notgive better run-time performance than using a runningmaximum despite its superior asymptotic complexity Inour implementation we use a running maximum

432 Bidding algorithm

This algorithm increases the bottleneck value gradu-ally starting from the ideal bottleneck value B13 until itfinds a feasible partition which is also optimal Considera partition Pt frac14 s0 s1y sPS constructed byPROBEethBtTHORN for an infeasible Bt After detecting theinfeasibility of this Bt value the point is to determine thenext larger bottleneck value B to be investigatedClearly the separator indices of the partitions to beconstructed by future PROBEethBTHORN calls with B4Bt willnever be to the left of the respective separator indices ofPt Moreover at least one of the separators shouldmove right for feasibility since the load of the lastprocessor determines infeasibility of the current Bt value(ie LP4Bt) To avoid missing the smallest feasiblebottleneck value the next larger B value is selected asthe minimum of processor loads that will be obtained bymoving the end-index of every processor to the right byone position That is the next larger B value is equal tominfmin1ppoP fLp thorn wspthorn1gLPg Here we call theLp thorn wspthorn1 value the bid of processor Pp which refersto the load of Pp if the first task tspthorn1 of the nextprocessor is augmented to Pp The bid of the lastprocessor PP is equal to the load of the remaining tasksIf the smallest bid B comes from processor Pb probingwith new B is performed only for the remainingprocessors PbPbthorn1yPPS in the suffix Wsb1thorn1Nof the W arrayThe bidding algorithm is presented in Fig 7 The

innermost whilendashloop implements a linear probingscheme such that the new positions of the separatorsare determined by moving them to the right one by oneThis linear probing scheme is selected because newpositions of separators are likely to be in a closeneighborhood of previous ones Note that binary searchis used only for setting the separator indices for the firsttime After the separator index sp is set for processor Pp

during linear probing the repeatndashuntil-loop terminates ifit is not possible to partition the remaining subchainT spthorn1N into P p processors without exceeding thecurrent B value ie rbid frac14 Lr=ethP pTHORN4B where Lr

denotes the weight of the remaining subchain In thiscase the next larger B value is determined by consider-ing the best bid among the first p processors and rbid

As seen in Fig 7 we maintain a prefix-minimumarray BIDS for computing the next larger B value HereBIDS is an array of records of length P whereBIDSfrac12 pB and BIDSfrac12 pb store the best bid value ofthe first p processors and the index of the definingprocessor respectively BIDSfrac120 helps the correctness ofthe running prefix-minimum operationThe complexity of the bidding algorithm for integer

task weights under model M is OethN thorn P log N thornP wmax thorn P2ethwmax=wavgTHORNTHORN Here OethNTHORN cost comes fromthe initial prefix-sum operation on the W array andOethP log NTHORN cost comes from initial settings of separatorsthrough binary search The B value is increased at mostBopt B13owmax times and each time the next B valuecan be computed in OethPTHORN time which induces the costOethP wmaxTHORN The total area scanned by the separators is atmost OethP2ethwmax=wavgTHORNTHORN For noninteger task weightscomplexity can reach OethP log N thorn P3ethwmax=wavgTHORNTHORN inthe worst case which occurs when only one separatorindex moves to the right by one position at each B valueWe should note here that using a min-heap for findingthe next B value enables terminating a repeat-loopiteration as soon as a separator-index does not moveThe trade-off in this scheme is the Oethlog PTHORN cost incurredat each separator-index move due to respectivekey-update operations on the heap We implementedthis scheme as well but observed increased executiontimes

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996982

we find their model too restrictive since the minimumand maximum task weights can deviate substantiallyfrom wavg Here we establish a looser and more realisticmodel on task weights so that for any subchain T ij

with weight Wij sufficiently larger than wmax theaverage task weight within subchain T ij satisfiesOethwavgTHORN That is Dij frac14 j i thorn 1 frac14 OethWij=wavgTHORN Thismodel referred to here as model M directly inducesDSp frac14 OethPwmax=wavgTHORN since DWppDWP=2 frac14 Pwmax=2for p frac14 1 2yP 1

42 Dynamic-programming algorithm with static

separator-index bounding

The proposed DP algorithm referred to here as theDP+ algorithm exploits bounds on the separatorindices for an efficient solution Fig 6 illustrates theproposed DP+ algorithm where input parameters SL

and SH denote the index bound arrays each of size Pcomputed according to Corollary 45 with Bf frac14 BRBNote that SLP frac14 SHP frac14 N since only Bfrac12PN needs tobe computed in the last row As seen in Fig 6 only B

pj

values for j frac14 SLpSLp thorn 1ySHp are computed ateach row p by exploiting Corollary 45 which ensuresexistence of an optimal partition Popt frac14 s0 s1y sPSwith SLppsppSHp Thus these B

pj values will suffice

for correct computation of Bpthorn1i values for i frac14

SLpthorn1SLpthorn1 thorn 1ySHpthorn1 at the next row p thorn 1As seen in Fig 6 explicit range checking is avoided in

this algorithm for utmost efficiency However the j-index may proceed beyond SHp to SHp thorn 1 within therepeatndashuntil-loop while computing B

pthorn1i with

SLpthorn1pipSHpthorn1 in two cases In both cases functionsWjthorn1i and B

pj intersect in the open interval ethSHpSHp thorn

1THORN so that BpSHp

oWSHpthorn1i and BpSHpthorn1XWSHpthorn2i In the

first case i frac14 SHp thorn 1 so that Wjthorn1i and Bpj intersect in

Fig 6 Dynamic-programming algorithm with static separator-index

bounding

ethi 1 iTHORN which implies that Bpthorn1i frac14 Wi1i with j

pthorn1i frac14

SLp since Wi1ioBpi as mentioned in Section 32 In

the second case i4SHp thorn 1 for which Corollary 45guarantees that B

pthorn1i frac14 WSLpthorn1ipB

pSHpthorn1 and thus we

can safely select jpthorn1i frac14 SLp Note that WSLpthorn1i frac14

BpSHpthorn1 may correspond to a case leading to another

optimal partition with jpthorn1i frac14 spthorn1 frac14 SHp thorn 1 As seen in

Fig 6 both cases are efficiently resolved simply bystoring N to B

pSHpthorn1 as a sentinel Hence in such cases

the condition WSHpthorn1ioBpSHpthorn1 frac14 N in the ifndashthen

statement following the repeatndashuntil-loop is always trueso that the j-index automatically moves back to SHpThe scheme of computing B

pSHpthorn1 for each row p which

seems to be a natural solution does not work sincecorrect computation of B

pthorn1SHpthorn1thorn1 may necessitate more

than one Bpj value beyond the SHp index bound

A nice feature of the DP approach is that it can beused to generate all optimal partitions by maintaining aP N matrix to store the minimum j

pi index values

defining the Bpi values at the expense of increased

execution time and increased asymptotic space require-ment Recall that index bounds SL and SH computedaccording to Corollary 45 restrict the search space forat least one optimal solution The index bounds can becomputed according to Lemma 44 for this purposesince the search space restricted by Lemma 44 includesall optimal solutionsThe running time of the proposed DP+ algorithm is

OethN thorn P log NTHORN thornPP

pfrac141 yethDSpTHORN Here OethNTHORN cost comesfrom the initial prefix-sum operation on the W arrayand OethP log NTHORN cost comes from the running time of theRB heuristic and computing the separator-index boundsSL and SH according to Corollary 45 Under modelM DSp frac14 OethPwmax=wavgTHORN and hence the complexity isOethN thorn P log N thorn P2 wmax = wavgTHORN The algorithm be-comes linear in N when the separator-index ranges donot overlap which is guaranteed by the conditionwmax frac14 Oeth2Wtot=P2THORN

43 Iterative refinement algorithms

In this work we improve the MS algorithm andpropose a novel CCP algorithm namely the bidding

algorithm which is run-time efficient for small-to-medium number of processors The main differencebetween the MS and the bidding algorithms is asfollows the MS algorithm moves along a series offeasible bottleneck values whereas the bidding algo-rithm moves along a sequence of infeasible bottleneckvalues so that the first feasible bottleneck value becomesthe optimal value

431 Improving the MS algorithm

The performance of the MS algorithm stronglydepends on the initial partition The initial partitionproposed by Manne and S^revik [23] satisfies the leftist

ARTICLE IN PRESS

Fig 7 Bidding algorithm

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 983

partition constraint but it leads to very poor run-timeperformance Here we propose using the partitiongenerated by PROBEethB13THORN as an initial partition Thispartition is also a leftist partition since moving anyseparator to the left will not decrease the load of thebottleneck processor This simple observation leads tosignificant improvement in run-time performance of thealgorithm Also using a heap as a priority queue does notgive better run-time performance than using a runningmaximum despite its superior asymptotic complexity Inour implementation we use a running maximum

432 Bidding algorithm

This algorithm increases the bottleneck value gradu-ally starting from the ideal bottleneck value B13 until itfinds a feasible partition which is also optimal Considera partition Pt frac14 s0 s1y sPS constructed byPROBEethBtTHORN for an infeasible Bt After detecting theinfeasibility of this Bt value the point is to determine thenext larger bottleneck value B to be investigatedClearly the separator indices of the partitions to beconstructed by future PROBEethBTHORN calls with B4Bt willnever be to the left of the respective separator indices ofPt Moreover at least one of the separators shouldmove right for feasibility since the load of the lastprocessor determines infeasibility of the current Bt value(ie LP4Bt) To avoid missing the smallest feasiblebottleneck value the next larger B value is selected asthe minimum of processor loads that will be obtained bymoving the end-index of every processor to the right byone position That is the next larger B value is equal tominfmin1ppoP fLp thorn wspthorn1gLPg Here we call theLp thorn wspthorn1 value the bid of processor Pp which refersto the load of Pp if the first task tspthorn1 of the nextprocessor is augmented to Pp The bid of the lastprocessor PP is equal to the load of the remaining tasksIf the smallest bid B comes from processor Pb probingwith new B is performed only for the remainingprocessors PbPbthorn1yPPS in the suffix Wsb1thorn1Nof the W arrayThe bidding algorithm is presented in Fig 7 The

innermost whilendashloop implements a linear probingscheme such that the new positions of the separatorsare determined by moving them to the right one by oneThis linear probing scheme is selected because newpositions of separators are likely to be in a closeneighborhood of previous ones Note that binary searchis used only for setting the separator indices for the firsttime After the separator index sp is set for processor Pp

during linear probing the repeatndashuntil-loop terminates ifit is not possible to partition the remaining subchainT spthorn1N into P p processors without exceeding thecurrent B value ie rbid frac14 Lr=ethP pTHORN4B where Lr

denotes the weight of the remaining subchain In thiscase the next larger B value is determined by consider-ing the best bid among the first p processors and rbid

As seen in Fig 7 we maintain a prefix-minimumarray BIDS for computing the next larger B value HereBIDS is an array of records of length P whereBIDSfrac12 pB and BIDSfrac12 pb store the best bid value ofthe first p processors and the index of the definingprocessor respectively BIDSfrac120 helps the correctness ofthe running prefix-minimum operationThe complexity of the bidding algorithm for integer

task weights under model M is OethN thorn P log N thornP wmax thorn P2ethwmax=wavgTHORNTHORN Here OethNTHORN cost comes fromthe initial prefix-sum operation on the W array andOethP log NTHORN cost comes from initial settings of separatorsthrough binary search The B value is increased at mostBopt B13owmax times and each time the next B valuecan be computed in OethPTHORN time which induces the costOethP wmaxTHORN The total area scanned by the separators is atmost OethP2ethwmax=wavgTHORNTHORN For noninteger task weightscomplexity can reach OethP log N thorn P3ethwmax=wavgTHORNTHORN inthe worst case which occurs when only one separatorindex moves to the right by one position at each B valueWe should note here that using a min-heap for findingthe next B value enables terminating a repeat-loopiteration as soon as a separator-index does not moveThe trade-off in this scheme is the Oethlog PTHORN cost incurredat each separator-index move due to respectivekey-update operations on the heap We implementedthis scheme as well but observed increased executiontimes

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Fig 7 Bidding algorithm

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 983

partition constraint but it leads to very poor run-timeperformance Here we propose using the partitiongenerated by PROBEethB13THORN as an initial partition Thispartition is also a leftist partition since moving anyseparator to the left will not decrease the load of thebottleneck processor This simple observation leads tosignificant improvement in run-time performance of thealgorithm Also using a heap as a priority queue does notgive better run-time performance than using a runningmaximum despite its superior asymptotic complexity Inour implementation we use a running maximum

432 Bidding algorithm

This algorithm increases the bottleneck value gradu-ally starting from the ideal bottleneck value B13 until itfinds a feasible partition which is also optimal Considera partition Pt frac14 s0 s1y sPS constructed byPROBEethBtTHORN for an infeasible Bt After detecting theinfeasibility of this Bt value the point is to determine thenext larger bottleneck value B to be investigatedClearly the separator indices of the partitions to beconstructed by future PROBEethBTHORN calls with B4Bt willnever be to the left of the respective separator indices ofPt Moreover at least one of the separators shouldmove right for feasibility since the load of the lastprocessor determines infeasibility of the current Bt value(ie LP4Bt) To avoid missing the smallest feasiblebottleneck value the next larger B value is selected asthe minimum of processor loads that will be obtained bymoving the end-index of every processor to the right byone position That is the next larger B value is equal tominfmin1ppoP fLp thorn wspthorn1gLPg Here we call theLp thorn wspthorn1 value the bid of processor Pp which refersto the load of Pp if the first task tspthorn1 of the nextprocessor is augmented to Pp The bid of the lastprocessor PP is equal to the load of the remaining tasksIf the smallest bid B comes from processor Pb probingwith new B is performed only for the remainingprocessors PbPbthorn1yPPS in the suffix Wsb1thorn1Nof the W arrayThe bidding algorithm is presented in Fig 7 The

innermost whilendashloop implements a linear probingscheme such that the new positions of the separatorsare determined by moving them to the right one by oneThis linear probing scheme is selected because newpositions of separators are likely to be in a closeneighborhood of previous ones Note that binary searchis used only for setting the separator indices for the firsttime After the separator index sp is set for processor Pp

during linear probing the repeatndashuntil-loop terminates ifit is not possible to partition the remaining subchainT spthorn1N into P p processors without exceeding thecurrent B value ie rbid frac14 Lr=ethP pTHORN4B where Lr

denotes the weight of the remaining subchain In thiscase the next larger B value is determined by consider-ing the best bid among the first p processors and rbid

As seen in Fig 7 we maintain a prefix-minimumarray BIDS for computing the next larger B value HereBIDS is an array of records of length P whereBIDSfrac12 pB and BIDSfrac12 pb store the best bid value ofthe first p processors and the index of the definingprocessor respectively BIDSfrac120 helps the correctness ofthe running prefix-minimum operationThe complexity of the bidding algorithm for integer

task weights under model M is OethN thorn P log N thornP wmax thorn P2ethwmax=wavgTHORNTHORN Here OethNTHORN cost comes fromthe initial prefix-sum operation on the W array andOethP log NTHORN cost comes from initial settings of separatorsthrough binary search The B value is increased at mostBopt B13owmax times and each time the next B valuecan be computed in OethPTHORN time which induces the costOethP wmaxTHORN The total area scanned by the separators is atmost OethP2ethwmax=wavgTHORNTHORN For noninteger task weightscomplexity can reach OethP log N thorn P3ethwmax=wavgTHORNTHORN inthe worst case which occurs when only one separatorindex moves to the right by one position at each B valueWe should note here that using a min-heap for findingthe next B value enables terminating a repeat-loopiteration as soon as a separator-index does not moveThe trade-off in this scheme is the Oethlog PTHORN cost incurredat each separator-index move due to respectivekey-update operations on the heap We implementedthis scheme as well but observed increased executiontimes

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996984

44 Parametric search algorithms

In this work we apply theoretical findings given inSection 41 for an improved probe algorithm Theimproved algorithm which we call the restricted probe(RPROBE) exploits bounds computed according toCorollary 45 (with Bf frac14 BRB) to restrict the searchspace for sp separator values during binary searches ontheW array That is BINSRCHethWSLpSHpBsumTHORN inRPROBE searches W in the index range frac12SLpSHp tofind the index SLppsppSHp such that Wfrac12sppBsum

and Wfrac12sp thorn 14Bsum via binary search This schemeand Corollaries 42 and 46 reduce the complexity ofan individual probe to

PPpfrac141yethlog DpTHORN frac14 OethP log P thorn

P logethwmax=wavgTHORNTHORN Note that this complexity reduces toOethP log PTHORN for sufficiently large P where P frac14 Oethwmax=wavgTHORN Figs 8ndash10 illustrate RPROBE algorithms tailoredfor the respective parametric-search algorithms

441 Approximate bisection algorithm with dynamic

separator-index bounding

The proposed bisection algorithm illustrated in Fig 8searches the space of bottleneck values in range frac12B13BRB

Fig 8 Bisection as an e-approximation algorithm

Fig 9 Exact bisection algorithm with d

as opposed to frac12B13Wtot In this algorithm ifPROBEethBtTHORN frac14 TRUE then the search space is re-stricted to BpBt values and if PROBEethBtTHORN frac14 FALSEthen the search space is restricted to B4Bt values Inthis work we exploit this simple observation to proposeand develop a dynamic probing scheme that increasesthe efficiency of successive PROBE calls by modifyingseparator index-bounds depending on success and fail-ure of the probes Let Pt frac14 t0 t1y tPS be thepartition constructed by PROBEethBtTHORN Any futurePROBEethBTHORN call with BpBt will set the sp indices withspptp Thus the search space for sp can be restricted tothose indices ptp Similarly any future PROBEethBTHORN callwith BXBt will set sp indicesXtp Thus the search spacefor sp can be restricted to those indices XtpAs illustrated in Fig 8 dynamic update of separator-

index bounds can be performed in yethPTHORN time by a forndash

loop over SL or SH arrays depending on failure orsuccess respectively of RPROBE ethBtTHORN In our imple-mentation however this update is efficiently achieved inOeth1THORN time through the pointer assignment SLrsquoP orSHrsquoP depending on failure or success ofRPROBE ethBtTHORN

with dynamic separator-index bounding

ynamic separator-index bounding

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Fig 10 Nicolrsquos algorithm with dynamic separator-index bounding

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 985

Similar to the e-BISECT algorithm the proposede-BISECT+ algorithm is also an e-approximationalgorithm for general workload arrays However boththe e-BISECT and e-BISECT+ algorithms becomeexact algorithms for integer-valued workload arrays bysetting e frac14 1 As shown in Lemma 31 BRBoB13 thorn wmaxHence for integer-valued workload arrays the max-imum number of probe calls in the e-BISECT+algorithm is log wmax and thus the overall complexityis OethN thorn P log N thorn logethwmaxTHORNethP log P thorn P logethwmax=wavgTHORNTHORNTHORN under model M Here OethNTHORN cost comes fromthe initial prefix-sum operation on W and OethP log NTHORNcost comes from the running time of the RB heuristicand computing the separator-index bounds SL and SH

according to Corollary 45

442 Bisection as an exact algorithm

In this section we will enhance the bisectionalgorithm to be an exact algorithm for general workloadarrays by clever updating of lower and upper boundsafter each probe The idea is after each probe movingupper and lower bounds on the value of an optimalsolution to a realizable bottleneck value (total weight ofa subchain of W) This reduces the search space to afinite set of realizable bottleneck values as opposed toan infinite space of bottleneck values defined by a rangefrac12LBUB Each bisection step is designed to eliminate at

least one candidate value and thus the algorithmterminates in finite number of steps to find the optimalbottleneck valueAfter a probe RPROBE ethBtTHORN the current upper

bound value UB is modified if RPROBE ethBtTHORN succeedsNote that RPROBE ethBtTHORN not only determines thefeasibility of Bt but also constructs a partition P withcostethPtTHORNpBt Instead of reducing the upper bound UB

to Bt we can further reduce UB to the bottleneck valueB frac14 costethPtTHORNpBt of the partition Pt constructed byRPROBE ethBt) Similarly the current lower bound LB ismodified when RPROBE ethBtTHORN fails In this case insteadof increasing the bound LB to Bt we can exploit thepartition Pt constructed by RPROBE ethBtTHORN to increaseLB further to the smallest realizable bottleneck value B

greater than Bt Our bidding algorithm already describeshow to compute

B frac14 min min1ppoP

fLp thorn wspthorn1gLP

where Lp denotes the load of processor Pp in Pt Fig 9presents the pseudocode of our algorithmEach bisection step divides the set of candidate

realizable bottleneck values into two sets and eliminatesone of them The initial set can have a size between 1and N2 Assuming the size of the eliminated set can beanything between 1 and N2 the expected complexity of

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996986

the algorithm is

TethNTHORN frac14 1

N2

XN2

ifrac141TethiTHORN thorn OethP log P

thorn P logethwmax=wavgTHORNTHORN

which has the solution OethP log P log N thornP log N logethwmax=wavgTHORNTHORN Here OethP log P thorn P logethwmax=wavgTHORNTHORN is the cost of a probe operation and log N is theexpected number of probes Thus the overall complex-ity becomes OethN thorn P log P log N thorn P log N logethwmax=wavgTHORNTHORN where the OethNTHORN cost comes from the initialprefix-sum operation on the W array

443 Improving Nicolrsquos algorithm as a divide-and-

conquer algorithm

Theoretical findings of previous sections can beexploited at different levels to improve performance ofNicolrsquos algorithm A trivial improvement is to use theproposed restricted probe function instead of theconventional one The careful implementation schemegiven in Fig 5(b) enables use of dynamic separator-index bounding Here we exploit the bisection idea todesign an efficient divide-and-conquer approach basedon Nicolrsquos algorithmConsider the sequence of probes of the form

PROBEethW1jTHORN performed by Nicolrsquos algorithm forprocessor P1 to find the smallest index j frac14 i1 such thatPROBEethW1jTHORN frac14 TRUE Starting from a naive bottle-neck-value range ethLB0 frac14 0UB0 frac14 WtotTHORN success andfailure of these probes can narrow this range toethLB1UB1THORN That is each PROBEethW1jTHORN frac14 TRUEdecreases the upper bound to W1j and eachPROBEethW1jTHORN frac14 FALSE increases the lower bound toW1j Clearly we will have ethLB1 frac14 W1i11UB1 frac14 W1i1THORNat the end of this search for processor P1 Now considerthe sequence of probes of the form PROBEethWi1jTHORNperformed for processor P2 to find the smallestindex j frac14 i2 such that PROBEethWi1jTHORN frac14 TRUE Ourkey observation is that the partition Pt frac140 t1 t2y tP1NS to be constructed by anyPROBEethBtTHORN with LB1oBt frac14 Wi1joUB1 will satisfyt1 frac14 i1 1 since W1i11oBtoW1i1 Hence probes withLB1oBtoUB1 for processor P2 can be restricted to beperformed in W i1N where W i1N denotes theethN i1 thorn 1THORNth suffix of the prefix-summed W arrayThis simple yet effective scheme leads to an efficientdivide-and-conquer algorithm as follows Let T Pp

i

denote the CCP subproblem of ethP pTHORN-way partitioningof the ethN i thorn 1THORNth suffix T iN frac14 ti tithorn1y tNS ofthe task chain T onto the ethP pTHORNth suffix PpP frac14Ppthorn1Ppthorn2yPPS of the processor chain P Oncethe index i1 for processor P1 is computed the optimalbottleneck value Bopt can be defined by either W1i1 orthe bottleneck value of an optimal ethP 1THORN-way parti-tioning of the suffix subchain T i1N That is Bopt frac14

minfB1 frac14 W1i1 CethPP1i1

THORNg Proceeding in this way oncethe indices i1 i2y ipS for the first p processorsP1P2yPpS are determined then Bopt frac14minfmin1pbppfBb frac14 Wib1ibg CethPPp

ipTHORNg

This approach is presented in Fig 10 At the bthiteration of the outer forndashloop given ib1 ib is found inthe inner whilendashloop by conducting probes on W ib1N tocompute Bb frac14 Wib1 ib The dynamic bounds on theseparator indices are exploited in two distinct waysaccording to Theorem 47 First the restricted probefunction RPROBE is used for probing Second thesearch spaces for the bottleneck values of the processorsare restricted That is given ib1 the binary searchfor ib over all subchain weights of the form Wib1thorn1j forib1ojpN is restricted to Wib1thorn1j values forSLbpjpSHbUnder model M the complexity of this algorithm is

OethN thorn P log N thorn wmaxethP log P thorn P logethwmax=wavgTHORNTHORNTHORN forinteger task weights because the number of probescannot exceed wmax since there are at most wmax distinctbound values in the range frac12B13BRB For nonintegertask weights the complexity can be given asOethNthornP logNthornwmaxethP log PTHORN2thorn wmaxP

2 log P logethwmax=wavgTHORNTHORN since the algorithm makes OethwmaxP log PTHORN probesHere OethNTHORN cost comes from the initial prefix-sumoperation on W and OethP logNTHORN cost comes from therunning time of the RB heuristic and computing theseparator-index bounds SL and SH according toCorollary 45

5 Load balancing applications

51 Parallel sparse matrixndashvector multiplication

Sparse matrixndashvector multiplication (SpMxV) is oneof the most important kernels in scientific computingParallelization of repeated SpMxV computations re-quires partitioning and distribution of the matrix Twopossible 1D sparse-matrix partitioning schemes arerowwise striping (RS) and columnwise striping (CS)Consider parallelization of SpMxV operations of theform y frac14 Ax in an iterative solver where A is an N N

sparse matrix and y and x are N-vectors In RSprocessor Pp owns the pth row stripe Ar

p of A and isresponsible for computing yp frac14 Ar

px where yp is the pthstripe of vector y In CS processor Pp owns the pthcolumn stripe Ac

p of A and is responsible for computingyp frac14 Ac

p x where y frac14PP

pfrac141yp All vectors used in the

solver are divided conformably with row (column)partitioning in the RS (CS) scheme to avoid unneces-sary communication during linear vector operations RSand CS schemes require communication before or afterlocal SpMxV computations thus they can also beconsidered as pre- and post-communication schemesrespectively In RS each task tiAT corresponds to the

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 987

atomic task of computing the inner-product of row i ofmatrix A with column vector x In CS each task tiATcorresponds to the atomic task of computing the sparseDAXPY operation y frac14 y thorn xia13i where a13i is the ithcolumn of A Each nonzero entry in a row and columnof A incurs a multiply-and-add operation thus thecomputational load wi of task ti is the number ofnonzero entries in row i (column i) in the RS (CS)scheme This defines how load balancing problem forrowwise and columnwise partitioning of a sparse matrixwith a given ordering can be modeled as a CCPproblemIn RS (CS) by allowing only row (column) reorder-

ing load balancing problem can be described as thenumber partitioning problem which is NP-Hard [9] Byallowing both row and column reordering the problemof minimizing communication overhead while maintain-ing load balance can be described as graph andhypergraph partitioning problems [514] which areNP-Hard [1021] as well However possibly highpreprocessing overhead involved in these models maynot be justified in some applications If the partitioner isto be used as part of a run-time library for a parallelizingcompiler for a data-parallel programming language[3334] row and column reordering lead to high memoryrequirement due to the irregular mapping table andextra level of indirection in locating distributed dataduring each multiply-and-add operation [35] Further-more in some applications the natural row and columnordering of the sparse matrix may already be likely toinduce small communication overhead (eg bandedmatrices)The proposed CCP algorithms are surprisingly fast so

that the initial prefix-sum operation dominates theirexecution times in sparse-matrix partitioning In thiswork we exploit the standard compressed row storage

(CRS) and compressed column storage (CCS) datastructures for sparse matrices to avoid the prefix-sumoperation In CRS an array DATA of length NZ storesnonzeros of matrix A in row-major order where NZ frac14Wtot denotes the total number of nonzeros in A Anindex array COL of length NZ stores the column indicesof respective nonzeros in array DATA Another indexarray ROW of length N thorn 1 stores the starting indices ofrespective rows in the other two arrays Hence anysubchain weight Wij can be efficiently computed usingWij frac14 ROW frac12 j thorn 1 ROW frac12i in Oeth1THORN time without anypreprocessing overhead CCS is similar to CRS withrows and columns interchanged thus Wij is computedusing Wij frac14 COL frac12 j thorn 1 COLfrac12i

511 A better load balancing model for iterative solvers

The load balancing problem for parallel iterativesolvers has usually been stated considering only theSpMxV computations However linear vector opera-tions (ie DAXPY and inner-product computations)

involved in iterative solvers may have a considerableeffect on parallel performance with increasing sparsityof the coefficient matrix Here we consider incorporat-ing vector operations into the load-balancing model asmuch as possibleFor the sake of discussion we will investigate this

problem for a coarse-grain formulation [2332] of theconjugate-gradient (CG) algorithm Each iteration ofCG involves in order of computational dependencyone SpMxV two inner-products and three DAXPYcomputations DAXPY computations do not involvecommunication whereas inner-product computationsnecessitate a post global reduction operation [19] onresults of the two local inner-products In rowwisestriping pre-communication operations needed forSpMxV and the global reduction operation constitutepre- and post-synchronization points respectively forthe aggregate of one local SpMxV and two local inner-product computations In columnwise striping theglobal reduction operation in the current iteration andpost-communication operations needed for SpMxV inthe next iteration constitute the pre- and post-synchro-nization points respectively for an aggregate of threelocal DAXPY and one local SpMxV computationsThus columnwise striping may be favored for a widercoverage of load balancing Each vector entry incurs amultiply-and-add operation during each linear vectoroperation Hence DAXPY computations can easily beincorporated into the load-balancing model for the CSscheme by adding a cost of three to the computationalweight wi of atomic task ti representing column i ofmatrix A The initial prefix-sum operation can still beavoided by computing a subchain weight Wij as Wij frac14COLfrac12 j thorn 1 COLfrac12i thorn 3eth j i thorn 1THORN in constant timeNote that two local inner-product computations stillremain uncovered in this balancing model

52 Sort-first parallel direct volume rendering

Direct volume rendering (DVR) methods are widelyused in rendering unstructured volumetric grids forvisualization and interpretation of computer simulationsperformed for investigating physical phenomena invarious fields of science and engineering A DVRapplication contains two interacting domains objectspace and image space Object space is a 3D domaincontaining the volume data to be visualized Imagespace (screen) is a 2D domain containing pixels fromwhich rays are shot into the 3D object domain todetermine the color values of the respective pixels Basedon these domains there are basically two approaches fordata parallel DVR image- and object-space parallelismwhich are also called as sort-first and sort-last paralle-lism according to the taxonomy based on the point ofdata redistribution in the rendering pipeline [25] Pixelsor pixel blocks constitute the atomic tasks in sort-first

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996988

parallelism whereas volume elements (primitives) con-stitute the atomic tasks in sort-last parallelismIn sort-first parallel DVR screen space is decomposed

into regions and each region is assigned to a separateprocessor for local rendering The primitives whoseprojection areas intersect more than one region arereplicated Sort-first parallelism has an advantage ofprocessors generating complete images for their localscreen subregion but it faces load-balancing problemsin the DVR of unstructured grids due to uneven on-screen primitive distributionImage-space decomposition schemes for sort-first

parallel DVR can be classified as static and adaptive

[20] Static decomposition is a view-independent schemeand the load-balancing problem is solved implicitly byscattered assignment of pixels or pixel blocks Load-balancing performance of this scheme depends on theassumption that neighbor pixels are likely to have equalworkload since they are likely to have similar views ofthe volume As the scattered assignment scheme assignsadjacent pixels or pixel blocks to different processors itdisturbs image-space coherency and increases theamount of primitive replication Adaptive decomposi-tion is a view-dependent scheme and the load-balancingproblem is solved explicitly by using the primitivedistribution on the screenIn adaptive image-space decomposition the number

of primitives with bounding-box approximation is takento be the workload of a screen region Primitivesconstituting the volume are tallied to a 2D coarse meshsuperimposed on the screen Some primitives mayintersect multiple cells The inverse-area heuristic [26]is used to decrease the amount of error due to countingsuch primitives multiple times Each primitive incre-ments the weight of each cell it intersects by a valueinversely proportional to the number of cells theprimitive intersects In this heuristic if we assume thatthere are no shared primitives among screen regionsthen the sum of the weights of individual mesh cellsforming a region gives the number of primitives in thatregion Shared primitives may still cause some errorsbut such errors are much less than counting suchprimitives multiple times while adding mesh-cellweightsMinimizing the perimeter of the resulting regions in

the decomposition is expected to minimize the commu-nication overhead due to the shared primitives 1Ddecomposition ie horizontal or vertical striping of thescreen suffers from unscalability A Hilbert space-fillingcurve [31] is widely used for 2D decomposition of 2Dnonuniform workloads In this scheme [20] the 2Dcoarse mesh superimposed on the screen is traversedaccording to the Hilbert curve to map the 2D coarsemesh to a 1D chain of mesh cells The load-balancingproblem in this decomposition scheme then reduces tothe CCP problem Using a Hilbert curve as the space-

filling curve is an implicit effort towards reducing thetotal perimeter since the Hilbert curve avoids jumpsduring the traversal of the 2D coarse mesh Note thatthe 1D workload array used for partitioning is a real-valued array because of the inverse-area heuristic usedfor computing weights of the coarse-mesh cells

6 Experimental results

All algorithms were implemented in the C program-ming language All experiments were carried out on aworkstation equipped with a 133 MHz PowerPC and64 MB of memory We have experimented with P frac14 16-32- 64- 128- and 256-way partitioning of each testdataTable 2 presents the test problems we have used in our

experiments In this table wavg wmin and wmax columnsdisplay the average minimum and maximum taskweights The dataset for the sparse-matrix decomposi-tion comes from matrices of linear programming testproblems from the Netlib suite [11] and the IOWA

Optimization Center [22] The sparsity patterns of thesematrices are obtained by multiplying the respectiverectangular constraint matrices with their transposesNote that the number of tasks in the sparse-matrix(SpM) dataset also refers to the number of rows andcolumns of the respective matrix The dataset for image-space decomposition comes from sort-first parallel DVRof curvilinear grids blunt-fin and post representing theresults of computational fluid dynamic simulationswhich are commonly used in volume rendering Theraw grids consist of hexahedral elements and areconverted into an unstructured tetrahedral data formatby dividing each hexahedron into five tetrahedronsTriangular faces of tetrahedrons constitute the primi-tives mentioned in Section 52 Three distinct 1Dworkload arrays are constructed both for blunt-fin andpost as described in Section 52 for coarse meshes ofresolutions 256 256 512 512 and 1024 1024superimposed on a screen of resolution 1024 1024The properties of these six workload arrays aredisplayed in Table 2 The number of tasks is much lessthan coarse-mesh resolution because of the zero-weighttasks which can be compressed to retain only nonzero-weight tasksThe following abbreviations are used for the CCP

algorithms H1 and H2 refer to Miguet and Piersonrsquos[24] heuristics and RB refers to the recursive-bisectionheuristic all described in Section 31 DP refers to theOethethN PTHORNPTHORN-time dynamic-programming algorithm inFig 1 MS refers to Manne and S^revikrsquos iterative-refinement algorithm in Fig 2 eBS refers to thee-approximate bisection algorithm in Fig 4 NC- andNC refer to the straightforward and careful implemen-tations of Nicolrsquos parametric-search algorithm in

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Table 2

Properties of the test set

Sparse-matrix dataset Direct volume rendering (DVR) dataset

Name No of tasks N Workload No of nonzeros ex time Name No of tasks N Workload

Total Wtot Per rowcol (task) SpMxV Total Per task

wavg wmin wmax (ms) Wtot wavg wmin wmax

NL 7039 105 089 1493 1 361 2255 blunt256 17 303 303 K 1754 0020 159097

cre-d 8926 372 266 4171 1 845 7220 blunt512 93 231 314 K 336 0004 66150

CQ9 9278 221 590 2388 1 702 4590 blunt1024 372 824 352 K 094 0001 41104

ken-11 14 694 82 454 561 2 243 1965 post256 19 653 495 K 2519 0077 324550

mod2 34 774 604 910 1740 1 941 12405 post512 134 950 569 K 422 0015 109200

world 34 506 582 064 1687 1 972 11945 post1024 539 994 802 K 149 0004 154678

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 989

Figs 5(a) and (b) respectively Abbreviations endingwith lsquolsquo+rsquorsquo are used to represent our improved versionsof these algorithms That is DP+ eBSthorn NC+ refer toour algorithms given in Figs 6 8 and 10 respectivelyand MS+ refers to the algorithm described inSection 431 BID refers to our bidding algorithm givenin Fig 7 EBS refers to our exact bisection algorithmgiven in Fig 9 Both eBS and eBSthorn algorithms areeffectively used as exact algorithms for the SpM datasetwith e frac14 1 for integer-valued workload arrays in theSpM dataset However these two algorithms were nottested on the DVR dataset since they remain approx-imation algorithms due to the real-valued task weightsin DVRTable 3 compares load-balancing quality of heuristics

and exact algorithms In this table percent load-imbalance values are computed as 100 ethB B13THORN=B13where B denotes the bottleneck value of the respectivepartition and B13 frac14 Wtot=P denotes ideal bottleneckvalue OPT values refer to load-imbalance values ofoptimal partitions produced by exact algorithms Table 3clearly shows that considerably better partitions areobtained in both datasets by using exact algorithmsinstead of heuristics The quality gap between exactalgorithms and heuristics increases with increasingP The only exceptions to these observations are 256-way partitioning of blunt256 and post256 for which theRB heuristic finds optimal solutionsTable 4 compares the performances of the static

separator-index bounding schemes discussed inSection 41 The values displayed in this table are thesums of the sizes of the separator-index ranges normal-ized with respect to N ie

PP1pfrac141 DSp=N where DSp frac14

SHp SLp thorn 1 For each CCP instance ethN PTHORNPrepresents the size of the search space for the separatorindices and the total number of table entries referencedand computed by the DP algorithm The columnslabeled L1 L3 and C5 display total range sizes obtainedaccording to Lemma 41 Lemma 43 and Corollary 45respectively As seen in Table 4 proposed practicalscheme C5 achieves substantially better separator-index

bounds than L1 despite its inferior worst-case behavior(see Corollaries 42 and 46) Comparison of columns L3and C5 shows the substantial benefit of performing left-to-right and right-to-left probes with B13 according toLemma 44 Comparison of ethN PTHORNP and the C5column reveals the effectiveness of the proposedseparator-index bounding scheme in restricting thesearch space for separator indices in both SpM andDVR datasets As expected the performance of indexbounding decreases with increasing P because ofdecreasing N=P values The numbers for the DVRdataset are better than those for the SpM datasetbecause of the larger N=P values In Table 4 values lessthan 1 indicate that the index bounding scheme achievesnonoverlapping index ranges As seen in the tablescheme C5 reduces the total separator-index range sizesbelow N for each CCP instance with Pp64 in both theSpM and DVR datasets These results show that theproposed DP+ algorithm becomes a linear-time algo-rithm in practiceThe efficiency of the parametric-search algorithms

depends on two factors the number of probes and thecost of each probe The dynamic index boundingschemes proposed for parametric-search algorithmsreduce the cost of an individual probe Table 5illustrates how the proposed parametric-search algo-rithms reduce the number of probes To compareperformances of EBS and eBSthorn on the DVR datasetwe forced the eBSthorn algorithm to find optimal partitionsby running it with e frac14 wmin and then improving theresulting partition using the BID algorithm ColumneBSthorn ampBID refers to this scheme and values after lsquolsquo+rsquorsquodenote the number of additional bottleneck values testedby the BID algorithm As seen in Table 5 exactly onefinal bottleneck-value test was needed by BID to reachan optimal partition in each CCP instanceIn Table 5 comparison of the NC- and NC columns

shows that dynamic bottleneck-value bounding drasti-cally decreases the number of probes in Nicolrsquosalgorithm Comparison of the eBS and eBSthorn columnsin the SpM dataset shows that using BRB instead of Wtot

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Table 3

Percent load imbalance values

Sparse-matrix dataset DVR dataset

CCP instance Heuristics OPT CCP instance Heuristics OPT

Name P H1 H2 RB Name P H1 H2 RB

NL 16 260 244 120 035 blunt256 16 460 120 049 034

32 502 575 344 095 32 693 261 194 112

64 895 901 560 237 64 1452 944 944 231

128 3313 2716 2278 499 128 3825 2439 1667 482

256 6955 6955 6078 1425 256 9672 3703 3421 3421

cre-d 16 227 098 053 045 blunt512 16 095 098 098 016

32 419 442 374 103 32 138 138 118 033

64 712 492 434 173 64 287 269 166 053

128 2557 1873 1670 288 128 562 845 462 097

256 3754 2681 3520 1095 256 1418 1433 934 228

CQ9 16 185 185 058 058 blunt1024 16 094 057 095 010

32 565 288 224 090 32 189 121 097 014

64 1325 1149 764 143 64 499 216 144 026

128 3396 3222 2234 351 128 1006 425 247 057

256 5862 5862 5862 1472 256 1965 968 968 098

ken-11 16 374 201 098 021 post256 16 110 143 076 056

32 374 467 374 118 32 323 398 323 111

64 1317 1317 1317 129 64 1728 1104 1090 310

128 1317 1689 1317 680 128 4535 2909 2909 829

256 5099 5099 5099 711 256 6786 6786 6786 6786

mod2 16 006 006 006 003 post512 16 049 125 033 033

32 019 019 019 007 32 094 161 090 058

64 742 272 218 018 64 485 533 485 094

128 1615 629 246 041 128 1803 1455 1015 172

256 1947 1947 1892 123 256 3003 2529 2529 373

world 16 027 018 009 004 post1024 16 070 070 053 020

32 103 037 027 008 32 149 149 141 054

64 473 473 473 028 64 285 285 149 091

128 637 637 637 076 128 1315 1179 910 111

256 2799 2741 2741 111 256 4050 1337 1419 254

Averages over P

16 176 115 074 036 16 144 101 104 028

32 399 339 288 076 32 264 205 161 064

64 901 678 569 143 64 789 558 496 131

128 1997 1766 1409 336 128 2174 1542 1202 291

256 4389 3891 3604 918 256 4482 2793 2676 1860

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996990

for the upper bound on the bottleneck values consider-ably reduces the number of probes Comparison of theeBSthorn and EBS columns reveals two different behaviorson SpM and DVR datasets The discretized dynamicbottleneck-value bounding used in EBS produces onlyminor improvement on the SpM dataset because of thealready discrete nature of integer-valued workloadarrays However the effect of discretized dynamicbottleneck-value bounding is significant on the real-valued workload arrays of the DVR datasetTables 6 and 7 display the execution times of CCP

algorithms on the SpM and DVR datasets respectively

In Table 6 the execution times are given as percents ofsingle SpMxV times For the DVR dataset actualexecution times (in msecs) are split into prefix-sum timesand partitioning times In Tables 6 and 7 executiontimes of existing algorithms and their improved versionsare listed in the same order under each respectiveclassification so that improvements can be seen easily Inboth tables BID is listed separately since it is a newalgorithm Results of both eBS and eBSthorn are listed inTable 6 since they are used as exact algorithms on SpMdataset with e frac14 1 Since neither eBS nor eBSthorn can beused as an exact algorithm on the DVR dataset EBS as

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Table 4

Sizes of separator-index ranges normalized with respect to N

Sparse-matrix dataset DVR dataset

CCP instance ethN PTHORNP L1 L3 C5 CCP instance ethN PTHORNP L1 L3 C5

Name P Name P

NL 16 1597 032 013 0036 blunt256 16 1599 038 007 0003

32 3186 123 058 0198 32 3194 122 030 0144

64 6343 497 224 0964 64 6377 510 381 0791

128 12569 1981 1960 4305 128 12706 2166 1288 2784

256 24673 7796 8283 22942 256 25223 10168 5048 10098

cre-d 16 1597 016 004 0019 blunt512 16 1600 009 017 0006

32 3189 076 090 0137 32 3199 035 026 0043

64 6355 289 185 0610 64 6396 164 069 0144

128 12618 1159 1512 2833 128 12783 665 459 0571

256 24869 4666 5754 16510 256 25530 2805 1671 2262

CQ9 16 1597 030 003 0023 blunt1024 16 1600 005 009 0010

32 3189 121 034 0187 32 3200 018 018 0018

64 6357 484 296 0737 64 6399 077 074 0068

128 12625 1920 1866 3121 128 12796 343 253 0300

256 24896 7484 8680 23432 256 25582 1392 2036 1648

ken-11 16 1598 028 011 0024 post256 16 1599 035 004 0021

32 3193 110 113 0262 32 3195 150 052 0162

64 6373 442 734 0655 64 6379 612 426 0932

128 12689 1760 1456 4865 128 12717 2648 2368 4279

256 25156 6918 8544 13076 256 25268 12476 9048 16485

mod2 16 1599 016 000 0002 post512 16 1600 013 002 0003

32 3197 055 004 0013 32 3199 056 014 0063

64 6388 214 122 0109 64 6397 233 241 0413

128 12753 862 259 0411 128 12788 933 1015 1714

256 25412 3449 3902 2938 256 25552 3708 4628 6515

world 16 1599 015 001 0004 post1024 16 1600 017 007 0020

32 3197 059 006 0020 32 3200 054 027 0087

64 6388 230 281 0158 64 6399 219 061 0210

128 12753 923 719 0850 128 12797 877 948 0918

256 25411 3697 5315 2635 256 25588 3497 2728 4479

Averages over K

16 1597 021 006 0018 16 1600 020 008 0011

32 3188 086 068 0136 32 3198 073 028 0096

64 6352 343 263 0516 64 6391 303 209 0426

128 12607 1368 1274 2731 128 12765 1272 1055 1428

256 24826 5428 5755 14089 256 25457 5674 4193 6915

L1 L3 and C5 denote separator-index ranges obtained according to results of Lemmas 41 43 and Corollary 45

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 991

an exact algorithm for general workload arrays is listedseparately in Table 7As seen in Tables 6 and 7 the RB heuristic is faster

than both H1 and H2 for both SpM and DVR datasetsAs also seen in Table 3 RB finds better partitions thanboth H1 and H2 for both datasets These results revealRBrsquos superiority to H1 and H2In Tables 6 and 7 relative performance comparison of

existing exact CCP algorithms shows that NC is twoorders of magnitude faster than both DP and MS for

both SpM and DVR datasets and eBS is considerablyfaster than NC for the SpM dataset These results showthat among existing algorithms the parametric-searchapproach leads to faster algorithms than both thedynamic-programming and the iterative-improvementapproachesTables 6 and 7 show that our improved algorithms are

significantly faster than the respective existing algo-rithms In the dynamic-programming approach DP+ istwo-to-three orders of magnitude faster than DP so that

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Table 5

Number of probes performed by parametric search algorithms

Sparse-matrix dataset DVR dataset

CCP instance Parametric search algorithms CCP instance Parametric search algorithms

Name P NC NC NC+ eBS eBS+ EBS Name P NC NC NC+ eBS+ampBID EBS

NL 16 177 21 7 17 6 7 blunt256 16 202 18 6 16 + 1 9

32 352 19 7 17 7 7 32 407 19 7 19 + 1 10

64 720 27 5 16 7 6 64 835 18 10 21 + 1 12

128 1450 40 14 17 8 8 128 1686 25 15 22 + 1 13

256 2824 51 14 16 8 8 256 3556 203 62 23 + 1 11

cre-d 16 190 20 6 19 7 5 blunt512 16 245 20 8 17 + 1 9

32 393 25 11 19 9 8 32 502 24 13 18 + 1 10

64 790 24 10 19 8 6 64 1003 23 11 18 + 1 12

128 1579 27 10 19 9 9 128 2023 26 12 20 + 1 13

256 3183 37 13 18 9 9 256 4051 23 12 21 + 1 13

CQ9 16 182 19 3 18 6 5 blunt1024 16 271 21 10 16 + 1 13

32 373 22 8 18 8 7 32 557 24 12 18 + 1 12

64 740 29 12 18 8 8 64 1127 21 11 18 + 1 12

128 1492 40 12 17 8 8 128 2276 28 13 19 + 1 13

256 2971 50 14 18 9 9 256 4564 27 16 21 + 1 15

ken-11 16 185 21 6 17 6 6 post256 16 194 22 9 18 + 1 7

32 364 20 6 17 6 6 32 409 19 8 20 + 1 12

64 721 48 9 17 7 7 64 823 17 11 22 + 1 12

128 1402 61 8 16 7 7 128 1653 19 13 22 + 1 14

256 2783 96 7 17 8 8 256 3345 191 102 23 + 1 13

mod2 16 210 20 6 20 5 4 post512 16 235 24 9 16 + 1 10

32 432 26 8 19 5 5 32 485 23 12 18 + 1 11

64 867 24 6 19 8 8 64 975 22 13 20 + 1 14

128 1727 38 7 20 7 7 128 1975 25 17 21 + 1 14

256 3444 47 9 20 9 9 256 3947 29 20 22 + 1 15

world 16 211 22 6 19 5 4 post1024 16 261 27 10 17 + 1 13

32 424 26 5 19 6 6 32 538 23 13 19 + 1 14

64 865 30 12 19 9 9 64 1090 22 12 17 + 1 14

128 1730 44 10 19 8 8 128 2201 25 15 21 + 1 16

256 3441 44 11 19 10 10 256 4428 28 16 22 + 1 16

Averages over P

16 193 205 57 185 58 52 16 235 220 87 167+1 102

32 390 230 75 183 68 65 32 483 220 108 187+1 115

64 784 303 90 180 78 73 64 976 205 113 193+1 127

128 1564 417 102 180 78 78 128 1969 247 142 208+1 138

256 3108 542 113 180 88 88 256 3982 835 380 220+1 138

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996992

DP+ is competitive with the parametric-search algo-rithms For the SpM dataset DP+ is 630 timesfaster than DP on average in 16-way partitioning andthis ratio decreases to 378 189 106 and 56 withincreasing number of processors For the DVR datasetif the initial prefix-sum is not included DP+ is 1277578 332 159 and 71 times faster than DP forP frac14 16 32 64 128 and 256 respectively on averageThis decrease is expected because the effectiveness ofseparator-index bounding decreases with increasingP These experimental findings agree with the variationin the effectiveness of separator-index bounding values

seen in Table 4 In the iterative refinement approachMS+ is also one-to-three orders of magnitude faster thanMS where this drastic improvement simply depends onthe scheme used for finding an initial leftist partitionAs Tables 6 and 7 reveal significant improvement

ratios are also obtained for the parametric searchalgorithms On average NC+ is 42 35 31 37 and37 times faster than NC for P frac14 16 32 64 128 and256 respectively for the SpM dataset For the DVRdataset if the initial prefix-sum time is not includedNC+ is 42 32 26 25 and 27 times faster than NCfor P frac14 16 32 64 128 and 256 respectively For the

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Table 6

Partitioning times for sparse-matrix dataset as percents of SpMxV times

CCP instance Heuristics Exact algorithms

Name P H1 H2 RB Existing Proposed

DP MS eBS NC DP+ MS+ eBS+ NC+ BID

NL 16 009 009 009 93 119 093 120 040 044 040 040 022

32 018 018 013 177 194 177 217 173 173 080 080 044

64 035 035 031 367 307 315 585 581 412 186 177 182

128 089 084 071 748 485 630 1712 1987 2692 426 732 421

256 151 155 137 1461 757 1109 4031 9180 9641 980 1570 2106

cre-d 16 003 001 001 33 36 033 037 012 006 010 011 003

32 004 006 006 71 61 065 093 047 120 029 033 010

64 012 012 008 147 98 122 169 166 183 061 071 021

128 028 029 014 283 150 241 359 547 1094 168 168 061

256 053 054 025 571 221 420 1022 2705 2602 330 446 120

CQ9 16 004 004 004 59 73 048 059 020 013 015 013 017

32 009 009 007 127 120 094 131 094 072 041 041 028

64 020 017 015 248 195 185 318 301 447 100 144 068

128 044 044 031 485 303 318 852 965 1882 244 305 251

256 092 092 063 982 469 651 2033 6956 722 532 745 1492

ken-11 16 010 010 010 238 344 127 188 056 061 046 041 025

32 020 020 015 501 522 229 310 382 478 122 117 163

64 046 046 036 993 778 448 1308 941 2478 310 346 229

128 117 117 071 1876 1139 921 3959 5013 5003 641 662 174

256 214 214 142 3827 1706 1776 11913 12901 24148 1491 1450 2911

mod2 16 002 002 001 92 119 027 034 007 005 006 006 003

32 004 004 002 188 192 051 078 019 018 016 015 007

64 011 010 006 378 307 104 191 086 218 051 055 023

128 024 023 011 751 482 228 592 261 372 106 123 053

256 048 048 023 1486 756 426 1114 1211 5004 291 343 366

world 16 002 002 002 99 122 029 033 008 005 007 008 003

32 004 004 004 198 196 059 088 026 021 018 019 008

64 012 011 009 385 306 116 170 109 484 064 054 035

128 024 023 019 769 496 219 532 448 1015 131 123 177

256 047 047 036 1513 764 406 1183 1154 6999 346 350 248

Averages over P

16 005 005 005 102 136 060 078 024 022 021 020 012

32 010 010 008 210 214 112 153 123 147 051 051 043

64 023 022 018 420 332 215 457 364 704 129 141 093

128 054 053 036 819 509 426 1334 1537 201 286 352 451

256 101 101 071 1640 779 798 3549 5684 9269 662 817 1207

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 993

SpM dataset eBSthorn is 34 25 18 16 and 13 timesfaster than eBS for P frac14 16 32 64 128 and 256respectively These improvement ratios in the executiontimes of the parametric search algorithms are below theimprovement ratios in the numbers of probes displayedin Table 5 Overhead due to the RB call and initialsettings of the separator indices contributes to thisdifference in both NC+ and eBSthorn Furthermore costsof initial probes with very large bottleneck values arevery cheap in eBSIn Table 6 relative performance comparison of

the proposed exact CCP algorithms shows that BID is

the clear winner for small to moderate P values(ie Pp64) in the SpM dataset Relative performanceof BID degrades with increasing P so that botheBSthorn and NC+ begin to run faster than BID forPX128 in general where eBSthorn becomes the winner Forthe DVR dataset NC+ and EBS are clear winnersas seen in Table 7 NC+ runs slightly faster than EBSfor Pp128 but EBS runs considerably faster thanNC+ for P frac14 256 BID can compete with thesetwo algorithms only for 16- and 32-way partitioningof grid blunt-fin (for all mesh resolutions) As seen inTable 6 BID takes less than 1 of a single SpMxV time

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESS

Table 7

Partitioning times (in msecs) for DVR dataset

CCP instance Prefix Heuristics Exact algorithms

Name P sum H1 H2 RB Existing Proposed

DP MS NC DP+ MS+ NC+ EBS BID

blunt256 16 002 002 003 68 49 036 021 053 010 014 003

32 005 005 005 141 77 077 076 076 021 027 024

64 195 011 012 009 286 134 139 386 872 064 071 078

128 024 025 020 581 206 366 1237 2995 178 184 233

256 047 050 037 1139 296 5253 4289 078 1396 311 5669

blunt512 16 003 003 003 356 200 055 025 091 014 016 009

32 007 007 007 792 353 131 123 213 044 043 022

64 1345 018 019 014 1688 593 265 368 928 094 110 045

128 039 040 028 3469 979 584 1501 4672 229 263 165

256 074 075 054 7040 1637 970 5750 16449 559 595 1072

blunt1024 16 003 003 003 1455 780 075 093 477 025 028 019

32 012 011 008 3251 1432 181 202 892 060 062 031

64 5912 027 027 019 6976 2353 350 735 2228 127 137 139

128 050 051 037 14 337 3911 914 3301 6938 336 356 1621

256 097 100 068 29 417 6567 1659 18009 53810 848 898 1629

post256 16 002 003 003 79 61 046 018 013 010 011 024

32 005 005 005 157 100 077 105 174 026 031 076

64 216 010 011 010 336 167 137 526 972 061 070 467

128 025 026 018 672 278 290 2294 5510 165 177 2396

256 047 048 037 1371 338 4516 8478 077 1304 362 29932

post512 16 002 002 003 612 454 063 020 044 014 016 121

32 007 007 007 1254 699 130 249 195 038 042 336

64 2055 018 018 013 2559 1139 258 1693 3873 097 107 853

128 038 039 028 5221 1936 584 6825 15615 227 244 3154

256 070 073 053 10 683 3354 1267 25648 72654 544 539 14088

post1024 16 003 004 003 2446 1863 091 288 573 017 021 263

32 009 008 008 5056 2917 161 1357 1737 051 058 1273

64 6909 025 026 017 10 340 4626 356 3505 3325 135 147 3253

128 051 053 036 21 102 7838 790 15634 54692 295 328 7696

256 095 098 067 43 652 13770 1630 76459 145187 755 702 30070

Averages normalized wrt RB times ( prefix-sum time not included )

16 083 094 100 27 863 18919 2033 2583 6950 500 589 2439

32 110 106 100 23 169 12156 1847 4737 7282 583 646 3902

64 130 135 100 22 636 9291 1788 8281 14519 700 781 5381

128 135 139 100 22 506 7553 2046 16836 48119 860 931 8681

256 135 139 100 24 731 6880 5910 39025 77299 1955 1051 28677

Averages normalized wrt RB times ( prefix-sum time included )

16 100 100 100 32 23 108 104 109 101 102 103

32 100 100 100 66 36 115 121 129 104 105 113

64 100 100 100 135 58 127 197 290 111 112 155

128 101 101 100 275 94 162 475 1053 128 130 325

256 102 102 100 528 142 798 1464 1371 295 155 2673

A Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996994

for Pp64 on average For the DVR dataset (Table 7)the initial prefix-sum time is considerably larger thanthe actual partitioning time of the EBS algorithm inall CCP instances except 256-way partitioning ofpost256 As seen in Table 7 EBS is only 12 slower

than the RB heuristic in 64-way partitionings onaverageThese experimental findings show that the pro-

posed exact CCP algorithms should replaceheuristics

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996 995

7 Conclusions

We proposed runtime efficient chains-on-chains par-titioning algorithms for optimal load balancing in 1-Ddecompositions of nonuniform computational domainsOur main idea was to run an effective heuristic as a pre-processing step to find a lsquolsquogoodrsquorsquo upper bound on theoptimal bottleneck value and then exploit lower andupper bounds on the optimal bottleneck value to restrictthe search space for separator-index values Thisseparator-index bounding scheme was exploited in astatic manner in the dynamic-programming algorithmdrastically reducing the number of table entries com-puted and referenced A dynamic separator-indexbounding scheme was proposed for parametric searchalgorithms to narrow separator-index ranges after eachprobe We enhanced the approximate bisection algo-rithm to be exact by updating lower and upper boundsinto realizable values after each probe We proposed anew iterative-refinement scheme that is very fast forsmall-to-medium numbers of processorsWe investigated two distinct applications for experi-

mental performance evaluation 1D decomposition ofirregularly sparse matrices for parallel matrixndashvectormultiplication and decomposition for image-spaceparallel volume rendering Experiments on the sparsematrix dataset showed that 64-way decompositions canbe computed 100 times faster than a single sparse matrixvector multiplication while reducing the load imbalanceby a factor of four over the most effective heuristicExperimental results on the volume rendering datasetshowed that exact algorithms can produce 38 timesbetter 64-way decompositions than the most effectiveheuristic while being only 11 slower on average

8 Availability

The methods proposed in this work are implemented inC programming language and are made publicly availableat httpwwwcseuiucedu~alipinarccp

References

[1] S Anily A Federgruen Structured partitioning problems Oper

Res 13 (1991) 130ndash149

[2] C Aykanat F Ozguner Large grain parallel conjugate gradient

algorithms on a hypercube multiprocessor in Proceedings of the

1987 International Conference on Parallel Processing Penn State

University University Park PA 1987 pp 641ndash645

[3] C Aykanat F Ozguner F Ercal P Sadayappan Iterative

algorithms for solution of large sparse systems of linear equations

on hypercubes IEEE Trans Comput 37 (12) (1988) 1554ndash1567

[4] SH Bokhari Partitioning problems in parallel pipelined and

distributed computing IEEE Trans Comput 37 (1) (1988) 48ndash57

[5] UV -Catalyurek C Aykanat Hypergraph-partitioning-based

decomposition for parallel sparse-matrix vector multiplication

IEEE Trans Parallel Distributed Systems 10 (7) (1999) 673ndash693

[6] H Choi B Narahari Algorithms for mapping and partitioning

chain structured parallel computations in Proceedings of the

1991 International Conference on Parallel Processing Austin

TX 1991 pp I-625ndashI-628

[7] GN Frederickson Optimal algorithms for partitioning trees and

locating p-centers in trees Technical Report CSD-TR-1029

Purdue University (1990)

[8] GN Frederickson Optimal algorithms for partitioning trees in

Proceedings of the Second ACM-SIAM Symposium Discrete

Algorithms Orlando FL 1991

[9] MR Garey DS Johnson Computers and Intractability A

Guide to the Theory of NP-Completeness WH Freeman San

Fransisco CA 1979

[10] MR Garey DS Johnson L Stockmeyer Some simplified np-

complete graph problems Theoret Comput Sci 1 (1976) 237ndash267

[11] DM Gay Electronic mail distribution of linear programming test

problems Mathematical Programming Society COAL Newsletter

[12] Y Han B Narahari H-A Choi Mapping a chain task to

chained processors Inform Process Lett 44 (1992) 141ndash148

[13] P Hansen KW Lih Improved algorithms for partitioning

problems in parallel pipelined and distributed computing IEEE

Trans Comput 41 (6) (1992) 769ndash771

[14] B Hendrickson T Kolda Rectangular and structurally non-

symmetric sparse matrices for parallel processing SIAM J Sci

Comput 21 (6) (2000) 2048ndash2072

[15] MA Iqbal Efficient algorithms for partitioning problems in

Proceedings of the 1990 International Conference on Parallel

Processing Urbana Champaign IL 1990 pp III-123ndash127

[16] MA Iqbal Approximate algorithms for partitioning and assign-

ment problems Int J Parallel Programming 20 (5) (1991) 341ndash361

[17] MA Iqbal SH Bokhari Efficient algorithms for a class of

partitioning problems IEEE Trans Parallel Distributed Systems

6 (2) (1995) 170ndash175

[18] MA Iqbal JH Saltz SH Bokhari A comparative analysis of

static and dynamic load balancing strategies in Proceedings of the

1986 International Conference on Parallel Processing Penn State

Univ University Park PA 1986 pp 1040ndash1047

[19] V Kumar A Grama A Gupta G Karypis Introduction to Parallel

Computing Benjamin Cummings Redwood City CA 1994

[20] H Kutluca TM Kur C Aykanat Image-space decompositionalgorithms for sort-first parallel volume rendering of unstructured

grids J Supercomput 15 (1) (2000) 51ndash93

[21] T Lengauer Combinatorial Algorithms for Integrated Circuit

Layout Wiley Chichester UK 1990

[22] Linear programming problems Technical Report IOWA Opti-

mization Center ftpcolbizuiowaedupubtestproblpgondzio

[23] F Manne T S^revik Optimal partitioning of sequences

J Algorithms 19 (1995) 235ndash249

[24] S Miguet JM Pierson Heuristics for 1d rectilinear partitioning

as a low cost and high quality answer to dynamic load balancing

Lecture Notes in Computer Science Vol 1225 Springer Berlin

1997 pp 550ndash564

[25] S Molnar M Cox D Ellsworth H Fuchs A sorting classification

of parallel rendering IEEE Comput Grap Appl 14 (4) (1994) 23ndash32

[26] C Mueller The sort-first rendering architecture for high-

performance graphics in Proceedings of the 1995 Symposium

on Interactive 3D Graphics Monterey CA 1995 pp 75ndash84

[27] DM Nicol Rectilinear partitioning of irregular data parallel

computations J Parallel Distributed Computing 23 (1994)

119ndash134

[28] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computations Technical Report 88-2

ICASE 1988

[29] DM Nicol DR OrsquoHallaron Improved algorithms for mapping

pipelined and parallel computation IEEE Trans Comput 40 (3)

(1991) 295ndash306

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)

ARTICLE IN PRESSA Pınar C Aykanat J Parallel Distrib Comput 64 (2004) 974ndash996996

[30] B Olstad F Manne Efficient partitioning of sequences IEEE

Trans Comput 44 (11) (1995) 1322ndash1326

[31] JR Pilkington SB Baden Dynamic partitioning of non-

uniform structured workloads with spacefilling curves IEEE

Trans Parallel Distributed Systems 7 (3) (1996) 288ndash299

[32] Y Saad Practical use of polynomial preconditionings for the

conjugate gradient method SIAM J Sci Statist Comput 6 (5)

(1985) 865ndash881

[33] MU Ujaldon EL Zapata BM Chapman HP Zima Vienna-

Fortranhpf extensions for sparse and irregular problems and

their compilation IEEE Trans Parallel Distributed Systems 8

(10) (1997) 1068ndash1083

[34] MU Ujaldon EL Zapata SD Sharma J Saltz Parallelization

techniques for sparse matrix applications J Parallel Distributed

Computing 38 (1996) 256ndash266

[35] MU Ujaldon EL Zapata SD Sharma J Saltz Experimental

evaluation of efficient sparse matrix distributions in Proceedings

of the ACM International Conference of Supercomputing

Philadelphia PA (1996) pp 78ndash86

Ali Pinarrsquos research interests are combinator-

ial scientific computingmdashthe boundary be-

tween scientific computing and combinatorial

algorithmsmdashand high performance comput-

ing He received his BS and MS degrees in

Computer Science from Bilkent University

Turkey and his PhD degree in Computer

Science with the option of Computational

Science and Engineering from University of

Illinois at Urbana-Champaign During his

PhD studies he frequently visited Sandia

National Laboratories in New Mexico which had a significant

influence on his education and research He joined the Computational

Research Division of Lawrence Berkeley National Laboratory in

October 2001 where he currently works as a research scientist

Cevdet Aykanat received the BS and MS

degrees from Middle East Technical Univer-

sity Ankara Turkey both in electrical

engineering and the PhD degree from Ohio

State University Columbus in electrical and

computer engineering He was a Fulbright

scholar during his PhD studies He worked

at the Intel Supercomputer Systems Division

Beaverton Oregon as a research associate

Since 1989 he has been affiliated with the

Department of Computer Engineering Bilk-

ent University Ankara Turkey where he is currently a professor His

research interests mainly include parallel computing parallel scientific

computing and its combinatorial aspects parallel computer graphics

applications parallel data mining graph and hypergraph-partitioning

load balancing neural network algorithms high performance infor-

mation retrieval and GIS systems parallel and distributed web

crawling distributed databases and grid computing He is a member

of the ACM and the IEEE Computer Society He has been recently

appointed as a member of IFIP Working Group 103 (Concurrent

Systems)