Construction of Optimal Multicast Trees Based on the Parameterized Communication Model

20

Transcript of Construction of Optimal Multicast Trees Based on the Parameterized Communication Model

Construction of Optimal Multicast Trees Based on theParameterized Communication ModelJu-Young L. Park and Hyeong-Ah Choi Natawut Nupairoj and Lionel M. NiDepartment of EE and Computer Science Department of Computer ScienceGeorge Washington University Michigan State UniversityWashington, DC 20052 East Lansing, MI 48824-1027fjlpark,[email protected] fnupairoj,[email protected] tree-based multicast algorithms have been proposed to provide an e�cientsoftware implementation on parallel platforms without hardware multicast support.These algorithms are either architecture-dependent (not portable) or architecture-independent (portable) but do not provide good performance when porting to di�erentparallel platforms. Based on the LogP model, the proposed parameterized communi-cation model can more accurately characterize the communication network of parallelplatforms. The model encompasses a number of critical system parameters which canbe easily measured on a given parallel platform. Based on the model, e�cient methodsto construct optimal multicast trees are proposed for both 1-port and �-port commu-nication architectures. Experimental results conducted on the IBM/SP at ArgonneNational Laboratory are presented to compare the performance of the optimal multi-cast tree with two other known tree-based multicast algorithms. We claim that ourproposed multicast algorithms can be ported to di�erent parallel platforms and pro-vide a near-optimal performance as the truly machine-speci�c optimal performance isachievable only when the underlying detailed network characteristics are considered.Keywords: Multicast Communication, Collective Communication, Optimal MulticastTree, Message Passing Parallel Computers, Parameterized Communication Model.

1 IntroductionMulticast is an important system-level one-to-many collective communication service. Several col-lective communication services (e.g., MPI [1]) such as broadcast and scatter are a subset or aderivation of multicast. Due to the importance of multicast, e�cient implementation of multicasthas been extensively studied in the past (e.g., [2, 3, 4, 5]). While hardware multicast is desirableand has received much attention recently [6], most of the systems use software in the form of acommunication library to support multicast atop existing point-to-point communication hardware[7, 8, 9]. However, most of the current implementations have the following drawbacks.� The design may not be scalable to large-scale systems.� The design is architecture dependent | thus not portable to other platforms.� For those architecture-independent designs, the performance varies signi�cantly from oneplatform to another.Obviously, optimal performance achievable from parallel machines is architecture-dependent.We can always tune an implementation to get a better performance based on some machine-speci�ccharacteristics, such as network topology. The main contribution of this paper is to characterize eachparallel machine by some critical machine-speci�c parameters. Our implementation of multicastcommunication library is based on these system parameters, which will be considered during thecompilation phase of the library. Thus, the proposed multicast algorithms are portable and do takeinto account these key system parameters. Consequently, the performance achievable, althoughmay not be optimal, will be close to optimal.The objective of this paper is to propose multicast algorithms that can be ported to anymessage-based parallel computing platforms regardless their underlying switching techniques, net-work topologies, communication protocols, and host interface designs. These platforms includeboth scalable parallel computers and workstation clusters. In the mean time, the algorithm canachieve good performance in an architecture-independent manner. Although our focus is on multi-cast communication, the proposed technique can also be applied to other communication libraries.This paper is organized as follows. Section 2 discusses the parameterized communication modelwhich will serve as a basis throughout the paper. Section 3 describes an e�cient technique to �ndoptimal multicast trees. In Section 4, we present and compare experimental results of runningthree di�erent multicast trees on the IBM/SP at Argonne National Laboratory. Section 5 extendsthe proposed technique to the �-port communication architecture and the construction of thecorresponding optimal multicast tree. Finally, the paper is concluded in Section 6.1

2 The Parameterized Communication ModelDi�erent parallel processing platforms have di�erent processing capabilities, network interconnectarchitectures, host interfaces, operating systems, etc. For example, some machines use direct net-works, such as meshes and hypercubes; some use indirect networks, such as fat trees and multistagenetworks; some workstation clusters use shared-medium networks, such as Ethernet and FDDI; andsome workstation clusters use switch-based networks, such as ATM switches and Myrinet worm-hole switches. Switching techniques also play an important role to the communication performance.Software latency is also an important overhead contributed to the communication latency. Varioustechniques, such as active messages [10] and direct memory mapping, have been used to reducesoftware latencies.An abstract parallel architecture model should carry a su�cient number of parameters to char-acterize the important features of di�erent parallel architectures. However, too many parametersmay defeat the purpose of having an abstract model, which will complicate the design of machine-independent software. The proposed parameterized communication model, which is an extensionof the LogP model [11], consists of a few critical machine-speci�c parameters as described below.2.1 Point-to-Point CommunicationPoint-to-point communication is fundamental to all communication subsystems, which forms a basisfor e�cient implementation of collective communication services, especially those involve softwareimplementation. A point-to-point communication service involves a sender and a receiver. The sendis typically a blocking send. This is due to the fact that the purpose of nonblocking send is to allowoverlapping between communication and computation. If nonblocking send is not used properly,the e�ect of bu�er over ow may prohibit us from accurately measuring the network performance.Similarly, a blocking receive is used.The model for point-to-point communication involves three parameters: sending latency (tsend),receiving latency (trecv), and network latency (tnet). tsend is the software latency in processing thesending message at the sender which includes the overhead of packetization, checksum computing,and, possibly, memory copying. trecv, similar to tsend, is the software overhead at the receiver. tnetis the time required to transmit the message across the network. In general, several signi�cantfactors contribute to tnet, including network bandwidth, underlying switching mechanism, andblocking time which is mainly due to network contention. In order to measure the network latencyaccurately, benchmarking must be conducted in a controlled environment such that the e�ect ofnetwork contention due to other unrelated messages can be avoided.2

t end

0 1P0

P1

0 1 2 3 4 75 6 8

t tnet recv

holdt

t send

Figure 1. The point-to-point communication model.Figure 1 illustrates the timing of sending a single message from the sender (P0) to the receiver(P1). We also de�ne two additional parameters: end-to-end latency (tend) and holding time (thold).tend is the interval between the sender starts sending a message until the receiver �nishes receiving.Hence, tend is tend = tsend + tnet + trecv (1)thold is the minimum time interval between two consecutive send or receive operations. The valueof thold is dependent on how blocking send is de�ned and implemented. In some designs, a blockingsend is considered complete when it is guaranteed that the receiver has received the message. In thiscase, thold can be greater than tsend+ tnet as thold may include the overhead of the acknowledgmentfrom the receiver1. In other designs, a blocking send is considered complete when the sending bu�ercan be reused, such as MPI [1]. Some techniques, such as separate bu�ering, DMA, and dedicatedcommunication processor, can hide some sending overhead from the sending processor. In this case,the holding time is less than the sending latency. One example is the MPI's bu�er sending modeor MPI Bsend. MPI also de�nes an additional sending mode called synchronous sending mode orMPI Ssend. The return of MPI Ssend informs the sender not only that the bu�er can be reused,but also that the receiver did aware the sending message and started receiving. In this particularmode, thold is also greater than tsend + tnet.Measuring tsend, tnet, and trecv individually is rather di�cult. In order to get the accuratemeasurement, some special techniques, such as using hardware monitor and software probe, are1If a sliding window protocol is used, the average value of thold is used.3

required. However, evaluating thold and tend are quite easy and can perform at the user-applicationlevel [12]. Table 1 summarizes the two latencies measured from the IBM/SP at Argonne NationalLaboratory2 [13]. For simplicity, we benchmark the network when there is no other workloadpresented. From the table, both parameters are dependent on the message size. In general, sendingand receiving involve memory copying and checksum computing. Hence, thold and tend increase asmessage size increases. Message Size thold (�secs) tend (�secs)m bytes 20 + (0:02 �m) 55 + (0:07 �m)Table 1. The measured latencies of the IBM/SP at Argonne National Lab.The communication performance can be predicted based on thold and tend since most communi-cation services are based on the send and receive operations. thold represents the latency of invokingthe send operation, and tend re ects the delay of delivering a message across the network.2.2 Multicast CommunicationIn this study, the de�nition of multicast is based on MPI's broadcast function (MPI Bcast), whichis a blocking operation. According to MPI's speci�cation, the blocking operation is consideredcomplete when the sending bu�er can be reused. Thus, when MPI Bcast is returned, it does notguarantee that all processes in the group have already received the message, which makes themeasurement of multicast latency di�cult [12].Most of the multicast communication are implemented in software in which the processors inthe group form a tree-like structure to dictate the ordering of the communication. Based on themulticast tree structure, each processor uses the point-to-point communication service to forward amessage to each of its children in turn. Assuming the so-called one-port communication architecturewhich is typical in most of the communication systems, a processor can send to only one destination(or one child) at a time which is referred to as a multicast step.Di�erent multicast tree structures may exhibit di�erent performance. Figure 2 shows threedi�erent multicast trees and the corresponding timing diagram. The �rst example is a sequentialtree or known as separate-addressing. In this approach, the root node (P0) sends a separate messageto each of the three processors in turn which requires three multicast steps. This approach was usedto implement the multicast function, Xmsend, in the Symult 2010 [14]. Figure 2(b) is a known2The processors in the IBM/SP at Argonne is based on the SP-1 technology, while the network is based on theSP-2 technology. 4

binomial tree based on the recursive-doubling technique [15]. In this approach, the number ofprocessors that have already received the message increases by a factor of two after each multicaststep. In Figure 2(b), the root node (P0) sends a message to its �rst child (P1) in the �rst step. Inthe second step, both P0 and P1 send messages to the other two processors, P2 and P3. The lastexample is a chain tree. As shown in Figure 2(c), the root node (P0) sends a message to its �rstchild (P1) in the �rst multicast step. In the second step, P1 forwards the message to its successor,P2, in the chain. The multicast operation is done when the message is completely forwarded to thelast processor in the chain, which is P3 in this example.0

312

0

2 1

3

0

1

2

3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

P1

P2

P3

P0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

P1

P2

P3

P0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

P1

P2

P3

P0

231

1 2

1

2

3

2

1 2 3

1 2

2

1

3

2

(a)

(b)

(c)

Figure 2. Examples of three multicast trees: (a) sequential tree, (b) binomial tree, and (c) chaintree (tsend = 2; tnet = 1; trecv = 2; and thold = 2).Theoretically, the performance of a multicast tree is dependent on the number of multicast stepsrequired which is a function of the group size. For a group size of k, the sequential tree and the chaintree require k�1 multicast steps, and the binomial tree requires dlog2ke multicast steps. Accordingto the number of multicast steps required, the binomial tree is the best. However, in practice,the performance of each multicast tree is dependent on the software latency of the underlyingnetwork. Consider the examples in Figure 2, where we assume tsend = 2; tnet = 1; trecv = 2; andthold = tsend. By taking into account the sending and receiving latencies, the sequential tree hasthe best performance.Similar to the point-to-point communication model, the multicast model has two importantparameters: the multicast holding time (tmhold) and the multicast latency (tmcast). tmhold is de�ned5

as the minimum time interval between two consecutive multicast operations. In other words, it isthe earliest time that the root node can resume the execution after issuing a multicast operation.tmcast is the elapsed time between the multicast is issued until the last processor receives themessage. For the three multicast trees in Fig. 2, we have tmhold equal to 6, 4, and 2, respectivelyand tmcast equal to 9, 10, and 15, respectively.Given a multicast algorithm, tmhold and tmcast of the algorithm can be estimated from thold andtend. Table 2 contains the multicast communication models of the three multicast trees. Note thatthese parameters can be measured by using ping and ping-pong benchmarks as mentioned in [12].Multicast Tree tmhold tmcastSequential (k � 1)� thold (k � 2)� thold + tendBinomial dlog2ke � thold b k3�2(blog2kc�1) c � thold + blog2kc � tendChain thold (k � 1)� tendTable 2. The multicast communication models of sequential tree, binomial tree, and chain treewhere k is number of nodes in the group.3 Optimal Multicast TreesThe discussion in Section 2.2 leads to an interesting question: since the theoretically optimalbinomial tree is not optimal in practice, can we develop an optimal multicast tree by consideringthe two important network parameters: thold and tend for a given message size? This section willaddress this issue.Consider the construction of an optimal multicast tree with k nodes, (P0; P1; : : : ; Pk�1), whereP0 is the root node. In the �rst step, P0 sends a message to Pj. Before Pj is ready to send thereceived multicast message to another node in the tree, P0 may be able to send the same messageto other nodes in the tree depending on the values of thold and tend. When Pj is ready to sendmessages, basically there are two multicast sub-trees: one is a j-node tree (P0; P1; : : : ; Pj�1) rootedat P0 and the other is a (k � j)-node tree (Pj ; Pj+1; : : : ; Pk�1) rooted at Pj as shown in Figure 3.The issue now becomes which node should belong to which subtree.In order to construct an optimal multicast tree, two requirements must be satis�ed. First, thenode Pj must be chosen such that the generated multicast tree is optimal. Second, the two subtrees(P0; P1; : : : ; Pj�1) and (Pj ; Pj+1; : : : ; Pk�1) must themselves be optimal. The same procedure is thenrecursively applied to each of the multicast sub-trees. Thus, this optimization problem exhibitstwo properties: optimal substructure and overlapping subproblems. With the presence of these twoproperties, the dynamic programming technique is applicable to �nd an optimal solution [16].6

-1k

j

j+1

0

1j-1

Figure 3. An optimal multicast tree with k nodes consists of two optimal multicast sub-trees.Let t[i] (for each i, 1 � i � k) be the minimum latency required to multicast a messageamong i nodes Pa; Pa+1; : : : ; Pa+i�1 (for an a, 0 � a � k � i) with node Pa as the root node. Wethen recursively de�ne t[i] as follows. If i = 1, there is only one node in the tree, thus t[i] = 0.When i > 1, Pa sends the message to some node Pa+j for 1 � j � i � 1. After thold units,Pa can continue to transmit to nodes in its subtree of (Pa; Pa+1; : : : ; Pa+j�1), and after tend timeunits, Pa+j has received the message and can start the transmission to the nodes in its subtreeof (Pa+j ; Pa+j+1; : : : ; Pa+i�1). Therefore, the multicast latency among i nodes is in its subtree.This suggests the following recurrence for t[i]. Thus, the multicast latency is the maximum latencybetween the multicast latency of the subtree of (Pa; Pa+1; : : : ; Pa+j�1) plus thold and the multicastlatency of the subtree of (Pa+j ; Pa+j+1; : : : ; Pa+i�1) plus tend. Note that thold is needed in theformer because Pa must send a message to Pa+j before it sends messages to other nodes in itssubtree, and tend is needed in the latter because Pa+j has to receive the message from Pa before itcan multicast the message to other nodes in its subtree. Therefore, we havet[i] = max(t[j] + thold; t[i� j] + tend)To ensure the optimality, we must choose the node Pa+j such that the multicast latency is minimal.Thus, we have the following recurrence for t[i].t[i] = 8><>: 0 if i = 1min1�j�i�1fmax(t[j] + thold; t[i� j] + tend)g if i > 17

The optimal multicast latency of a k-node tree is t[k] and this can be computed in O(k2) timeby the dynamic programming. In particular, we compute values in the order t[1]; t[2]; � � � ; t[k] andstore them in an array. In order to compute t[i] (for 1 < i � k), we consider each value of j from1 to i � 1, and determine the value of j for which max (t[j] + thold; t[i � j] + tend) is minimized.Thus the total running time is Pk�1i=1 i = O(k2).To improve the running time to O(k), we limit the choices of j that need to be considered ateach iteration. This is based on the observation about the nature of t[i] which is described in thefollowing lemma.Lemma 1 Let ji denote the value of j for which the recursive de�nition of t[i] achieves its minimumvalue (i.e., the best way to split a tree with i nodes is into subtrees of sizes ji and i � ji). Then,j2 = 1 and, for 2 � i � k � 1, ji+1 = ji + 1 or ji+1 = ji.The above lemma can be reduced from Lemma 2 which will be discussed in Section 5; thus,the proof of Lemma 2 will prove Lemma 1. In the following, we revise our dynamic programmingalgorithm based on Lemma 1.t[i] = 8>>>>>><>>>>>>: 0 if i = 1tend if i = 2minf maxf t[ji�1] + thold; t[i� ji�1] + tend g;maxf t[ji�1 + 1] + thold; t[i� 1� ji�1] + tend gg if i � 3where j2 = 1.Table 3 shows an example of our revised algorithm for k = 9, thold = 20, and tend = 55. Anoptimal multicast tree corresponding to Table 3 is shown in Figure 4. From this revised algorithm,we obtain the following theorem.Theorem 1 The optimal multicast latency of a k-node tree can be computed in O(k) time.4 ExperimentsWe conducted the experiments on the 128-node IBM/SP at Argonne National Laboratory. Ourimplementation uses MPI-F library version 1.41 [17]. In order to fully utilize the high-performanceswitch, the library euilibwith option us is used. Each data point in our results is the minimal value8

i ji i� ji t[ji] + thold t[i� ji] + tend t[i]1 - - - - 02 1 1 20 55 553 2 1 75 55 754 3 1 95 55 955 3 2 95 110 1106 4 2 115 110 1157 5 2 130 110 1308 5 3 130 130 1309 6 3 135 130 135Table 3. k = 9, thold = 20, and tend = 55.of 1,000 measurements to reduce the overhead due to network contention from other programs3.Readers may refer to [12] for techniques to benchmark multicast communication latency.In this Section, we discuss the results from our experiments by measuring the performance ofthree multicast tree: the popular binomial tree used by many researchers, the block-based binomialtree used in MPI-F, and the optimal multicast tree proposed in this paper, with respect to twodi�erent message sizes. The sequential tree and chain tree are not considered as they were knownto perform poorly on the IBM/SP [12].4.1 Block-Based Binomial TreeWe benchmark MPI Bcast of MPI-F library, which is based on the block-based binomial tree. Thistree is a combination of the binomial and sequential trees. The parameter blocksize determinesthe size of each block. First, some members of the process group are partitioned into �xed-sizeblocks, where the number of blocks must be a power of 2. Thus, some processors may not belongto any block. Then, the root node multicasts a message to the �rst processor in each block usingthe binomial tree. The �rst processor in each block then sends the message to other members inthe block using the sequential tree. Finally, the remaining nodes, say m of them, not belonging toany block are taken care by nodes 0 to m� 1. The parameter blocksize determines the shape ofthe tree. If blocksize is one, the tree is a binomial tree. If blocksize equals to the group size,the tree is equivalent to a sequential tree.Figure 5 shows the tree used in MPI-F for a group size of 9, where blocksize is 3. In this�gure, there are two blocks: P0; P1; P2 and P3; P4; P5, where P0 and P3 are the �rst processor ofeach block, and the two blocks form a two-node binomial tree. After P0 sends the message toP3, both processors send messages to other processors in their blocks. Then P0, P1, and P2 send3It is di�cult to reserve the whole machine for our benchmarking.9

0

6 234

5

1

P4

P0

P1

P2

P3

P5

P6

P7

P8

100

135

115

95

95

95

130

130

110

1 2 3 4

3

2 3

5

1

23

3

2 34

5

7 8

Figure 4. An optimal multicast tree with 9 nodes thold = 20, and tend = 55.messages to the remaining nodes which are P6, P7, and P8.Figure 5 also shows the measured latency for each processor when the processor received themessage. From the timing diagram, P8 is the last node receives the message which also de�nes themulticast latency.4.2 Comparison of Three Multicast TreesWe then compare the performance of three di�erent multicast trees with respect to two di�erentmessage sizes: 1 byte and 1K bytes. Since both chain tree and sequential tree show poor perfor-mance, we consider only the optimal tree presented in Section 3, the popular binomial tree usedby many researchers, and the block-based binomial (MPI-F) tree implemented in MPI-F. Figure 6illustrates the multicast latency (tmcast) of the three trees. For the case of the binomial tree, wewere unable to measure the performance for group size greater than 24 due to the heavy use of theIBM SP at Argonne recently (we will complete the measurements in the �nal paper). However, it is10

P4

P0

P1

P2

P3

P5

P6

P7

P8

1 2 3 4

3

4

2 3

88.47

100.38

120.63

107.94

117.90

138.45

126.90

136.07

155.42

3

4 5

23

1 2

0

2 3

6

7 8

3 4

1 4

block

remainder

block

Figure 5. The measured latency of MPI Bcast of MPI-F library on the IBM/SP.11

clear that the performance of the binomial tree is the worst among the three trees. As the numberof processors in a group increases, the multicast latency of all three trees increases. All three treesexhibit very close performance when the number of processors is small. In all cases, the optimalmulticast tree proposed in this paper does provide the best performance. When the message is 1byte, the MPI-F tree has slightly higher latency than the optimal tree. When the message sizeis 1024 bytes, the performance improvement of the optimal multicast tree over the MPI-F treeis noticeable, which indicated that the proposed optimal multicast tree, even not considering theunderlying network topology, performs well on the IBM/SP.

0

100

200

300

400

500

600

700

800

0 10 20 30 40 50 60 70

Multicast Group Size

Lat

ency

(u

sec)

Binomial (1KB)

MPIF (1KB)

Optimal (1KB)

Binomial (1B)

MPIF (1B)

Optimal (1B)

Figure 6. The multicast latency (tmcast) of three multicast trees with message sizes being 1 byteand 1K bytes, respectively.5 Extenstion to the �-Port Communication ModelOur previous discussion was based on the popular 1-port communication architecture. In someparallel machines, such as the TMC CM-5 and Intel/CMU iWARP, each processor has multiplecommunication ports. This section generalizes the construction of optimal multicast trees basedon the �-port communication model.As we de�ned earlier, thold is the time interval between two consecutive send operations througha single port. Let tint denote the time interval that a processor can initiate another send operationfrom a di�erent port. Obviously, we have tint < thold; otherwise, multiple ports do not bene�t.12

Given tint and thold, the maximum number of ports, �max, must satisfy the following inequality(�max � 1)tint < thold � �maxtint:The proposed �-port optimal multicast algorithm can be applied to any value of � � �max. Considerthe construction of a multicast tree with k nodes (P0; P1; � � � ; Pk�1). The O(k) time dynamicprogramming algorithm discussed in Section 3 for � = 1 will be generalized to develop an O(�k)time algorithm for an arbitrary value of � � �max.Let t[i] (for each i, 1 � i � k) denote the minimum latency required to multicast a messageamong i nodes. We have t[1] = 0 and t[2] = tend. For i � 3, t[i] can be recursively de�ned as follows.Consider a multicast tree with i nodes. The source node P0 sends the message to � nodes using� di�erent ports. Here we assume that node P0 sends the message to node Pqr for r = 1; 2; � � � ; �in this order. That is, if node P0 transmits to node Pq1 at time T , then node P0 can transmit tonode Pqr (2 � r � �) after time T + (r � 1)tint. Thus, the multicast tree with i nodes has � + 1subtrees rooted at nodes P0; Pq1 ; Pq2 ; � � � ; Pq� . Suppose the subtree rooted at node Pqr has jri nodesfor 1 � r � �, and the subtree rooted at node P0 has ji nodes. (Throughout this section, thesubtree with jri (or ji) nodes will denote the subtree rooted at node Pqr (respectively, P0) withoutany confusion.) This assumption implies that�Xr=1 jri = i� ji (2)Figure 7 shows the multicast tree with i nodes partitioned into �+ 1 subtrees.j i

q 2 q1

j i nodes2 j i1

nodesj iα

nodes

nodes

αq

0

Figure 7. An optimal multicast tree using � -ports.13

From the above discussion, we note the multicast latency of the subtree rooted at node Pqr(1 � r � �) is at least t[jri ] + tend + (r � 1)tint (3)and the multicast latency of the subtree rooted at node P0 is at leastt[ji] + thold: (4)To ensure the optimality, we must choose values for jri for each r, 1 � r � �, such that theresulting multicast latency is minimal. Thus, we have the following recurrence for t[i].t[i] = 8><>: 0 if i = 1min0�j1i ;���;j�i �i�1fmaxft[ji] + thold; max1�r��ft[jri ] + tend + (r � 1)tintggg if i > 1It is observed that all � ports may not be used when computing t[i] for some values of i. Forexample, suppose the multicast tree with i nodes has �0 + 1 subtrees (i.e., the root of the tree hasused only �0 ports) for �0 < �. In this case, jri will denote 0 for �0 < r � �. The optimal multicastlatency t[k] of a k-node tree is then computed in O(�k�+1) time using the above recurrence. Toverify this running time, we observe that there are at most k� possible values of j1i ; � � � ; j�i , and forany given �xed values of j1i ; � � � ; j�i , t[i] can be computed using � comparisons. As 1 � i � k, itimplies that t[k] can be computed in O(�k�+1) time.Based on the following lemma, we improve the running time of our dynamic programmingalgorithm to O(�k) limiting the choices of j1i ; � � � ; j�i at each iteration.Lemma 2 Let j1i ; � � � ; j�i denote the values for which the recursive de�nition of t[i] achieves itsminimum value (i.e., the best way to split a tree on i nodes is into subtrees of sizes j1i ; � � � ; j�i andi �P�r=1 jri ). Then, j12 = 1 and jr2 = 0 for 2 � r � �; and for 2 � i � k � 1, (i) ji+1 = ji orji+1 = ji + 1 and (ii) jri+1 = jri or jri+1 = jri + 1 for 1 � r � �.Remark: The above lemma when � = 1 is reduced to Lemma 1 in Section 3.Proof. Suppose the lemma does not hold for some value of �, say �0, for 1 � � � �max. Let i0 besuch that for any optimal multicast tree with i0 nodes, either ji0 � ji0�1 + 2 or jr1i0 � jr1i0�1 + 2 forsome r1, 1 � r1 � �0. Choose such an optimal multicast tree T 0 with i0 nodes (T 0 has subtreeswith ji0 , j1i0 ; � � � ; j�0i0 nodes) with the minimum value of A(T 0), whereA(T 0) =j ji0 � ji0�1 j + �0Xr=1fj jri0 � jri0�1 jg: (5)14

Suppose ji0 � ji0�1 + 2. Then, there exists r0 such thatjr0i0 � jr0i0�1 � 1: (6)By deleting one node from the subtree with ji0 nodes and adding one node to the subtree with jr0i0nodes, the multicast latency of the decreased subtree ist[ji0 � 1] + thold � t[ji0 ] + thold� t[i0] from the de�nition of t[i0]and the multicast latency of the increased subtree ist[jr0i0 + 1] + tend + (r0 � 1)tint� t[jr0i0�1] + tend + (r0 � 1)tint from Equation (6)� t[i0 � 1] from the de�nition of t[i0 � 1]� t[i0]The multicast latency of this new tree T 1 is still optimal (i.e., t[i0]) and A(T 1) < A(T 0). This is acontradiction to the choice of T 0.Next, assume that jr1i0 � jr1i0�1 + 2 for some r1, 1 � r1 � �0. By deleting one node from thesubtree with jr1i0 nodes, the multicast latency of the resulting subtree ist[jr1i0 � 1] + tend + (r1 � 1)tint� t[jr1i0 ] + tend + (r1 � 1)tint� t[i0] from the de�nition of t[i0]:We also note that as jr1i0 � jr1i0�1 + 2, eitherji0 � ji0�1 � 1 (7)or for some r0, 1 � r0 � �0, jr0i0 � jr0i0�1 � 1: (8)When Equation (7) holds, by adding the node deleted from the subtree with jr1i0 nodes to the15

subtree with ji0 nodes, the multicast latency of the incresed subtree ist[ji0 + 1] + thold � t[ji0�1] + thold from Equation (7)� t[i0 � 1] from the de�nition of t[i0 � 1]� t[i0]:Hence, the resulting multicast tree, say T 2, is still optimal.When Equation (8) holds, by adding the node deleted from the subtree with jr1i0 nodes to thesubtree with jr0i0 nodes, the multicast latency of the increased subtree ist[jr0i0 + 1] + tend + (r0 � 1)tint� t[jr0i0�1] + tend + (r0 � 1)tint from Equation (8)� t[i0 � 1] from the de�nition of t[i0 � 1]� t[i0]Again, the resulting multicast tree, say T 3, is optimal. Finally, we note that A(T 2) < A(T 0) forthe former case and A(T 3) < A(T 0) for the latter case. Either case is a contradiction to the choiceof T 0. This completes the proof of the lemma.Note that Lemma 2 implies t[i] = minfA; Bg for i > 1, whereA = maxft[i� 1]; t[ji�1 + 1] + tholdg (9)and B = min1�r��fmaxft[i� 1]; t[jri�1 + 1] + tend + (r � 1)tintgg: (10)As Equation (10) impliesB = maxft[i� 1]; min1�r��ft[jri�1 + 1] + tend + (r � 1)tintgg;this together with the fact that t[i] = minfA; Bg givest[i] = maxft[i � 1]; minft[ji�1 + thold; min1�r��ft[jri�1 + 1] + tend + (r � 1)tintggg: (11)We now revise our dynamic programming algorithm based on Equation (11) which runs in O(�k)time in the following. 16

i ji j1i j2i j3i t[ji�1 + 1] t[j1i�1 + 1] t[j2i�1 + 1] t[j3i�1 + 1] t[i]+thold +tend +tend + tint +tend + 2tint1 - - - - - - - - 02 1 1 0 0 - - - - 553 1 1 1 0 77 110 65 75 654 1 1 1 1 77 110 120 75 755 2 1 1 1 77 110 120 130 756 3 1 1 1 87 110 120 130 877 4 1 1 1 97 110 120 130 978 5 1 1 1 97 110 120 130 979 6 1 1 1 109 110 120 130 10910 6 2 1 1 119 110 120 130 11011 7 2 1 1 119 120 120 130 11912 8 2 1 1 119 120 120 130 119Table 4. The optimal multicast tree of a 3-port communication architecture with k = 12, tint = 10,thold = 22, and tend = 55.t[i] = 8>>>>>><>>>>>>: 0 if i = 1tend if i = 2max ft[i� 1]; min ft[ji�1 + 1] + thold;min1�r��ft[jri�1 + 1] + tend + (r � 1)tintggg if i � 3where j12 = 1 and jr2 = 0 for 2 � r � �.Theorem 2 The optimal multicast latency of a k-node tree can be computed in O(�k) time wheneach node has � communication ports.Table 4 shows an example for k = 12, tint = 10, thold = 22, tend = 55, and � = 3.6 ConclusionDesigning a portable algorithm to achieve good performance on di�erent parallel platforms is highlydemanded. Based on the proposed parameterized communication model, e�cient methods to con-struct optimal multicast trees are proposed for both 1-port and �-port communication architec-tures. Here the term \optimal" applies on the basis of the parameterized communication model,not on any speci�c parallel machine. The proposed communication model is more suitable for thosemachines supporting cut-through switching and having a rich interconnection topology (i.e., the17

network is able to support many simultaneous transmissions without much contention). For suchparallel machine, we claim that the proposed architecture-independent multicast algorithm is near-optimal as it is obviously that the performance of an algorithm can be improved by consideringsome machine-speci�c features, such as the network topology and routing algorithm, which cannotbe captured as a generic system parameter.The proposed parameterized communication model is useful in the design of portable commu-nication libraries. Many techniques used in this paper can be extended to implement some othercollective communication services, such as scatter. When a system changes or upgrades its criti-cal component, such as new processors, new host interface, or new communication protocols, thesystem has to run some benchmark programs to obtain the new measure of system parameters.The corresponding communication library has to be recompiled to take e�ect of those new systemparameters.AcknowledgmentsThe authors wish to thank the Mathematics and Computer Science Division at Argonne NationalLaboratory for granting us to use their 128-node IBM SP system.References[1] M. P. I. Forum, MPI: A Message-Passing Interface Standard, Mar. 1994.[2] X. Lin and L. M. Ni, \Multicast communication in multicomputers networks," IEEE Trans-actions on Parallel and Distributed Systems, pp. 1105{1117, Oct. 1993.[3] H. Xu, Y. Gui, and L. M. Ni, \Optimal software multicast in wormhole-routed multistagenetworks," in Proceedings of Supercomputing'94, pp. 703{712, Nov. 1994.[4] C.-T. Ho and M.-Y. Kao, \Optimal broadcast on hypercubes with wormhole and E-cuberoutings," in Proceedings of the 1993 International Conference on Parallel and DistributedSystems, pp. 694{697, 1993.[5] D. K. Panda, S. Singal, and P. Prabhakaran, \Multidestination message passing mechanismconforming to base wormhole routing scheme," in Proc. of the First International Workshopon Parallel Computer Routing and Communication (PCRCW'94) (K. Bolding and L. Snyder,Eds.), pp. 131{145, Springer-Verlag, May 1994.[6] L. M. Ni, \Should scalable parallel computers support e�cient hardware multicast," in Pro-ceedings of the 1995 ICPP Workshop on Challenges for Parallel Processing, Aug. 1995.[7] B. Gropp, R. Lusk, T. Skjellum, and N. Doss, Portable MPI Model Implementation. ArgonneNational Laboratory, July 1994. 18

[8] P. K. McKinley, H. Xu, A. H. Esfahanian, and L. M. Ni, \Unicast-based multicast communica-tion in wormhole-routed networks," IEEE Transactions on Parallel and Distributed Systems,vol. 5, pp. 1252{1265, Dec. 1994.[9] W. Gropp and B. Smith, \Users manual for the Chameleon parallel programming tools," Tech.Rep. ANL-93/23, Argonne National Laboratory, June 1993.[10] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, \Active messages: Amechanism for integrated communication and computation," in Proc. of the 19th Annual In-ternational Symposium on Computer Architecture, pp. 256{266, May 1992.[11] D. Culler et al., \LogP: Towards a realistic model of parallel computation," in Proc. of the 4thACM SIGPLAN Sym. on Principles and Practice of Parallel Programming, May 1993.[12] N. Nupairoj and L. M. Ni, \Benchmarking of multicast communication services," Tech. Rep.MSU-CPS-ACS-103, Michigan State University, Department of Computer Science, East Lans-ing, Michigan, Apr. 1995.[13] W. Gropp, E. Lusk, and S. Pieper, \Users Guide for the ANL IBM SP-1 DRAFT," Tech. Rep.ANL/MCS-TM-00, Argonne National Laboratory, Feb. 1994.[14] C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Seizovic, C. S. Steele, and W.-K. Su,\The architecture and programming of the Ametek Series 2010 multicomputer," in Proceedingsof the Third Conference on Hypercube Concurrent Computers and Applications, Volume I,(Pasadena, CA), pp. 33{36, Association for Computing Machinery, Jan. 1988.[15] H. Sullivan and T. R. Bashkow, \A large scale, homogeneous, fully distributed parallel ma-chine," in Proceedings of the 4th Annu. Symp. Comput. Architecture, vol. 5, pp. 105{124, Mar.1977.[16] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms. Cambridge,Massachusetts: The MIT Press, 1992.[17] H. Franke, P. Hochschild, P. Pattnaik, and M. Snir, \MPI-F: An E�cient Implementation ofMPI on IBM-SP1," in Proceedings of the 1994 International Conference on Parallel Processing,(St. Charles, IL), pp. 197{201, Aug. 1994.

19