Unicast-based multicast communication in wormhole-routed networks

33

Transcript of Unicast-based multicast communication in wormhole-routed networks

Unicast-Based Multicast CommunicationinWormhole-Routed NetworksPhilip K. McKinley, Hong Xu, Abdol-Hossein Esfahanian, and Lionel M. NiTechnical ReportMSU-CPS-ACS-57January 1992(Revised December 1993)A short version of this report appeared inProc. 1992 International Conference on Parallel Processing.A complete version has been accepted to appear inIEEE Transactions on Parallel and Distributed Systems.

Unicast-Based Multicast Communication inWormhole-Routed Networks�yPhilip K. McKinley, Hong Xu, Abdol-Hossein Esfahanian, and Lionel M. NiDepartment of Computer ScienceMichigan State UniversityEast Lansing, Michigan 48824January 1992(Revised December 1993)AbstractMulticast communication, in which the same message is delivered from a sourcenode to an arbitrary number of destination nodes, is being increasingly demanded inparallel computing. System supported multicast services can potentially o�er improvedperformance, increased functionality, and simpli�ed programming, and may in turn beused to support various higher-level operations for data movement and global processcontrol.This paper presents e�cient algorithms to implement multicast communication inwormhole-routed direct networks, in the absence of hardware multicast support, byexploiting the properties of the switching technology. Minimum-time multicast algo-rithms are presented for n-dimensional meshes and hypercubes that use deterministic,dimension-ordered routing of unicast messages. Both algorithms can deliver a multi-cast message to m � 1 destinations in dlog2me message passing steps, while avoidingcontention among the constituent unicast messages. Performance results of implemen-tations on a 64-node nCUBE-2 hypercube and a 168-node Symult 2010 2D mesh aregiven.Keywords: Multicast communication, wormhole routing, massively parallelcomputer, direct network, hypercube, n-dimensional mesh, one-port architecture,dimension-ordered routing.�This work was supported in part by the National Science Foundation under Grants CDA-9121641, MIP-9204066,CDA-9222901, and CCR-9209873, by the Department of Energy under Grant DE-FG02-93ER25167, and by anAmeritech Faculty Fellowship.yA preliminary and concise version of this paper was presented at the 1992 International Conference on ParallelProcessing.

1 IntroductionScalable parallel architectures, designed to o�er corresponding increases in performance as thenumber of processors is increased, are seen as a viable platform on which to solve the so-calledgrand-challenge problems. Such systems, also known as massively parallel computers (MPCs), arecharacterized by the distribution of memory among an ensemble of nodes. The nodes, each ofwhich has its own processor, local memory, and other supporting devices, are often interconnectedby a point-to-point, or direct network. As the number of nodes in the system increases, the to-tal communication bandwidth, memory bandwidth, and processing capability of the system alsoincrease.E�cient communication among nodes is critical to the performance of MPCs. Regardless of howwell the computation is distributed among the processors, communication overhead can severelylimit speedup. Historically, such systems have supported only single-destination, or unicast, com-munication. A multicast communication service is one in which the same message is delivered froma source node to an arbitrary number of destination nodes. Broadcast is a special case of multicastin which the same message is delivered to all nodes in the network.Multicast communication has several uses in large-scale multiprocessors. First, numerous paral-lel applications, including parallel search algorithms and parallel graph algorithms, have been shownto bene�t from use of multicast services [1, 2]. Second, multicast is useful in the SPMD (SingleProgram Multiple Data) mode of computation, in which the same program is executed on di�erentprocessors with di�erent data. Speci�cally, multicast is fundamental to several operations, such asreplication [3] and barrier synchronization [4], that are supported in data parallel languages [5].Third, if a distributed shared-memory paradigm is supported, then multicast services may be usedto e�ciently support shared-data invalidation and updating [6].Most existing MPCs support only unicast communication in hardware. In these environments,multicast must be implemented in software by sending multiple unicast messages. Sending a sep-arate copy of the message from the source to every destination may require excessive time due toa bottleneck at the source processor. An alternative approach is to use a multicast tree of unicastmessages. In a multicast tree, the source node actually sends the message to only a subset of thedestinations. Each recipient of the message forwards it to some subset of the destinations that havenot yet received it. The process continues until all destinations have received the message.1

This paper proposes and evaluates new multicast tree algorithms that are designed to takeadvantage of several architectural properties found in new generation distributed-memory mul-tiprocessors. Speci�cally, the systems are assumed to use wormhole-routed n-dimensional meshnetworks in which unicast messages are routed deterministically, and in which there exists a singlecommunications channel between each node and the network. By exploiting these features, sig-ni�cant performance improvement may be attained in the absence of hardware multicast support.A lower bound on the number of message passing steps required to deliver the message from onesource node to m � 1 destination nodes in such systems is dlog2me. The algorithms presented inthis paper achieve this bound, while avoiding contention for communication channels among theconstituent unicast messages.The remainder of the paper is organized as follows. Section 2 describes the system model underconsideration, and Section 3 illustrates the issues and problems involved in supporting e�cientmulticast communication in such systems. Section 4 presents new theoretical results that providethe foundation for this work. Section 5 addresses the issue of supporting multicast in n-dimensionalmeshes that use dimension-ordered routing of unicast messages. Sections 6 and 7 describe methodsparticular to hypercubes and meshes, respectively. The proposed methods have been implementedon a 64-node nCUBE-2 [7] and 168-node Symult 2010 [8]; performance measurements are given ineach section that compare the algorithms with earlier approaches. Section 8 concludes the paper.2 System ModelA metric used to evaluate a distributed-memory system is its communication latency, which is thesum of three values: start-up latency, network latency, and blocking time [9]. The start-up latencyis the time required for the system to handle the message at both the source and destinationnodes. The network latency equals the elapsed time after the head of a message has entered thenetwork at the source until the tail of the message emerges from the network at the destination.The blocking time includes all possible delays encountered during the lifetime of a message, forexample, delays due to channel contention in the network. Multicast latency, used to measuremulticast performance, is the time interval from when the source processor begins to send the �rstcopy of the message until the last destination processor has received the message. How to minimizemulticast latency depends on the particular system architecture. This paper addresses e�cient2

multicast implementation in a class of architectures characterized by four important propertiesdescribed below.First, their topologies are n-dimensional meshes. Formally, an n-dimensional mesh has k0 �k1�� � ��kn�2�kn�1 nodes, with ki nodes along each dimension i, where ki � 2 and 0 � i � n�1.Each node x is identi�ed by n coordinates, �n�1(x)�n�2(x) : : :�0(x), where 0 � �i(x) � ki � 1 for0 � i � n� 1. Two nodes x and y are neighbors if and only if �i(x) = �i(y) for all i, 0 � i � n� 1,except one, j, where j�j(y)� �j(x)j = 1. Several popular topologies for scalable architectures arespecial cases of the n-dimensional mesh, including the 2D mesh, 3D mesh, and hypercube (ki = 2for all i).Second, in order to reduce network latency and minimize bu�er requirements, these systemsuse the wormhole routing switching strategy [10]. In wormhole routing, a message is divided into anumber of its for transmission. The header it of a message governs the route, and the remaining its follow in a pipeline fashion. An important feature of wormhole routing is that the networklatency is distance-insensitive when there is no channel contention [9].Third, communication among nodes is handled by a separate router. As shown in Figure 1,several pairs of external channels connect the router to neighboring routers; the pattern in whichthe external channels are connected de�nes the network topology. The router can relay multiplemessages simultaneously, provided that each incoming message requires a unique outgoing chan-nel. In addition, two messages may be transmitted simultaneously in opposite directions betweenneighboring routers.Each router is connected to its local processor/memory by one or more pairs of internal channels.One channel of each pair is for input, the other for output. The fourth distinguishing characteristicof the class of architectures considered in this paper is that each node possesses exactly one pairof internal channels, as shown in Figure 1, resulting in the so-called \one-port communicationarchitecture" [11]. A major consequence of a one-port architecture is that the local processor musttransmit (receive) messages sequentially. Although additional pairs of internal channels can be usedto increase communication capacity, the one-port architecture is characteristic of many systems.Wormhole-routed, n-dimensional mesh architectures include the Symult 2010 [8], Intel Touch-stone DELTA [12], Intel Paragon, which use a 2D mesh topology; the MIT J-machine [13] andCaltech MOSAIC, which use a 3D mesh; and the nCUBE-2 [7] and nCUBE-3 [14], which use ahypercube. 3

Local

Processor/Memory

externalinputchannels

externaloutputchannels

Router

internaloutputchannels

internalinputchannels

Figure 1. A generic node architecture3 The ProblemAlthough hardware implementations of multicast communication would intuitively o�er better per-formance than software implementations, many such implementations exhibit either undesirableproperties or are restricted in their use. For example, the nCUBE-2 [7], which supports broadcastand restricted multicast in hardware, uses a method that may deadlock in hypercubes and 2Dmeshes if two or more such messages are sent simultaneously [15]. Although deadlock-free routingalgorithms for multicast and broadcast have recently been proposed [15] for various multicomputertopologies, such techniques are not yet supported in commercial systems.In wormhole-routed networks that support only unicast communication, all communicationoperations must be implemented in software by sending one or more unicast messages. For instance,a multicast operation may be implemented by sending a separate copy of the message from thesource to every destination. Depending on the number of destinations, such separate addressingmayrequire excessive time, particularly in a one-port architecture in which a local processor may sendonly one message at a time. Performance may be improved by organizing the unicast messages as amulticast tree, whereby the source node sends the message directly to a subset of the destinations,each of which forwards the message to one or more other destinations. Eventually, all destinationswill receive the message.The potential advantage of tree-based communication is apparent from the observed perfor-mance of various broadcast methods. Figure 2 compares the measured performance of separate4

010002000300040005000600070001 2 3 4 5 6Dimension of SubcubeMulticastLatency(�sec) Separate AddressingSoftware broadcast tree 22 2 2 2 2 2Figure 2. Comparison of 100-byte broadcastsaddressing and a multicast tree (speci�cally, the well-known spanning binomial tree [16]) algorithmfor subcubes of di�erent sizes on a 64-node nCUBE-2. The message length is �xed at 100 bytes.The tree approach o�ers substantial performance improvement over separate addressing.Which types of communication trees should be used depends on the switching strategy and uni-cast routing algorithm. A good multicast tree involves no local processors other than the source anddestination processors, exploits the distance-insensitivity of wormhole routing, and is of minimumheight, speci�cally, height k = dlog2(m)e for m � 1 destination nodes. Another key requirementis that there be no channel contention among the constituent messages of the multicast, that is,two unicast messages involved should not simultaneously require the same channel. Addressingthe practical consideration of channel contention among the constituent messages of multicastoperations, and the theory behind the resultant algorithms, distinguishes the approach presentedin this paper from previous investigations. How to achieve these goals depends on the switchingstrategy and unicast routing algorithm of the MPC.Unicast routing can be classi�ed as deterministic or adaptive. In deterministic routing, the pathis completely determined by the source and destination addresses. This method is also referred toas oblivious routing. A routing technique is adaptive if, for a given source and destination, whichpath is taken by a particular packet depends on dynamic network conditions, such as the presenceof faulty or congested channels. 5

A great deal of research has been conducted in the last few years on the subject of adaptivewormhole routing algorithms. Deadlock avoidance is the key issue in the design of such algorithms.A deadlock occurs when two or more messages are delayed forever due to a cyclic dependencyamong their requested resources. Because blocked wormhole-routed messages are not bu�ered andtherefore cannot be removed from the network, cyclic dependencies with respect to channel usagemust be avoided in order to avoid deadlock. Numerous deadlock-free adaptive unicast routingmethods have also been proposed [17, 18, 19, 20, 21, 22, 23]. One class of algorithms requiresadditional (virtual) channels [24] to support the adaptive routing [17, 18, 22]. In this method,the multicomputer network is partitioned into several disjoint acyclic subnetworks, with each sub-network containing channels that form all of the shortest paths from one node to some othernodes. Recently, several groups have developed adaptive routing algorithms that use only a modestnumber of virtual channels per physical channel [19, 21, 23]. Another method for adaptive unicastwormhole routing is the turn model [20], which involves analysis of the cycles that can be formedwhen messages change direction. Cycles are avoided by prohibiting certain \turns," producinga partially adaptive routing algorithm that does not require virtual channels. For a survey ofwormhole routing in direct networks, please refer to [9].In spite of the potential bene�ts of adaptive wormhole routing, most commercial systems simplyuse a deterministic routing algorithm that avoids deadlock by imposing an order in which resources,such as channels and bu�ers, must be acquired by messages. For example, dimension-orderedrouting [25] has been adopted in many wormhole-routed n-dimensional mesh systems. In thisapproach, freedom from deadlock is guaranteed by enforcing a strictly monotonic order on thedimensions of the network traversed by each message. Special cases of dimension-ordered routinginclude E-cube routing and XY routing for the hypercube and 2D mesh topologies, respectively [9].Although the user has no control over the routing of individual messages, the designer of a unicast-based collective operation may be able to reduce or eliminate channel contention by accounting forthe routing algorithm when de�ning the set of unicast messages and the order in which they aretransmitted.The following (small-scale) example illustrates the issues and di�culties involved in imple-menting e�cient multicast communication wormhole-routed networks that use dimension-orderedrouting. Consider the 4 � 4 2D mesh in Figure 3, and suppose that a multicast message is to besent from node (2; 1) to seven destinations, (0,0), (1,0), (2,0), (1,1), (1,2), (3,2), and (1,3).6

source node

destination node0,0 1,0 2,0

1,1 2,1

1,2

0,1

0,2

0,3

2,2

1,3 2,3

3,2

3,0

3,1

3,3

Figure 3. An example of multicast in 4� 4 meshIn early direct network systems that used store-and-forward switching, the procedure shownin Figure 4(a) could be used. At step 1, the source sends the message to node (1; 1). At step2, nodes (2; 1) and (1; 1) inform nodes (2; 0) and (1; 2), respectively. Continuing in this fashion,this implementation requires 4 steps to reach all destinations. Node (2; 2) is required to relay themessage, even though it is not a destination. Using the same routing strategy in a wormhole-routednetwork also requires 4 steps, as shown in Figure 4(b). In this case, however, only the router atnode (2; 2) is involved in forwarding the message. Hence, the message may be passed from (2; 1)to (3; 2) in one step, and no local processors other than the source and destinations are involved insending the message.In Figure 4(c), the branches of the tree are rearranged to take advantage of the distance insen-sitivity of wormhole routing. The local processor at each destination receives the message exactlyonce, although the routers at some other intermediate nodes are involved in the multicast. Usingthis method, the number of steps is apparently reduced to 3. However, closer inspection revealsthat the messages sent from node (1; 0) to node (1; 2) and from node (2; 1) to node (1; 3) in step3 use a common channel, namely, the (1; 1)-to-(1; 2) channel. Consequently, these two unicastscannot take place during the same step. Since one of the messages will be blocked until the othercompletes, again, this tree actually requires 4 steps.This situation is recti�ed in Figure 4(d), where only 3 steps are required. No local processorsother than the source and destinations are involved, and the messages sent within a particularstep do not contend for common channels. In practice, however, the message passing steps of themulticast operation may not be ideally synchronized, and contention may arise among messages7

[1][3]1,1

1,2

1,1

0,0

[2]

[3]

[3]

[1]

[2]

1,1

1,2

1,3

[3]

[3]

3,1

3,2

2,2

2,1

2,0

[2]

[3]

[2]

[3]

[3]

2,1

1,1

1,2

1,3

2,0

1,0

0,0

3,2

2,2

[1]

[2]

[3]

[2]

[3]

[4]

[3]

[4]

2,1

(a) A multicast tree based on store−and−forward switching

1,1

1,2

1,3

2,0

1,0

0,0

3,2

2,2

[1]

[2]

[3]

[2]

[3]

[4]

[3]

[3]

2,1

(b) A multicast tree based on wormhole routing

2,0

1,0

1,1

1,2

1,0

0,0

1,1

2,1

3,1

3,2

1,1

1,2

1,3

[1]

[2]

[3]

[3]

[2]

[3]

[3]

[3]

[3]

[3]

[3]

[3]

[3]

2,1

(c) Collision occurs in step 3 at the channel between (1,1) and (1,2)

1,1

1,0

3,1

3,2

1,3

1,1

1,2

1,1

0,0

2,0

[1]

[2]

[3]

[3]

[2][3]

[3]

[3] [1]

[3]

[3]

[2]

1,0

1,1

1,2

[3]

2,1

(e) Collision−free multicast tree

destination

source

intermediate node(router only used)

intermediate node (processor used)

[i]

message sent in step i

area of potentialcontention

t S Rt Nt+ +2

t S

1,0t S + Nt

t S Rt Nt+ +2 2

t S3

t S + Nt31,1

(d) Collision may occur at the channel between (1,1) and (1,2) if t S Rt Nt22= =Figure 4. Unicast-based software multicast treessent in di�erent steps. As indicated in Section 2, startup latency includes system call time at boththe source and destination nodes; these latencies are termed the sending latency, tS , and receivinglatency, tR, respectively.Depending on the values of these latencies and that of the network latency, tN , messages canbe sent concurrently, but in di�erent steps. In Figure 4(d), for example, the message sent from8

node (2; 1) to node (1; 0) in the �rst step leaves the source at time tS , having incurred sendinglatency at that node, and enters node (1; 0) at time tS + tN . The message incurs receiving latencyat node (1; 0), which then forwards a copy of the message in the second step to node (1; 2). Afteranother sending latency, the message leaves node (1; 0) at time 2tS + tR + tN . In the second step,the source node (2; 1) incurs another sending latency when it sends to node (3; 2). In the thirdstep, when the source sends to node (1; 3), that message begins entering the network at time 3tS .Now consider a scenario in which tS = 2tR = 2tN . Under these conditions, the message from node(1; 0) to (1; 2) and the message from node (2; 1) to (1; 3) enter the network at the same time, 3tS ,and contention will occur on the (1; 1)-to-(1; 2) channel. The multicast tree in Figure 4(e), which isbased on the methods presented in the following sections, is contention-free regardless of messagelength or startup latency.4 Theoretical ResultsIn order to formally describe unicast-based multicast services, the underlying topology of the net-work can be represented by a directed graph, or digraph, G(V;E) with the node set V (G) and thearc set E(G). A node u 2 V (G) represents the processor u with its router, and an arc (u; v) 2 E(G)represents the unidirectional link from router u to router v. Since the systems considered in thispaper support the simultaneous transmission of messages between two adjacent nodes, the di-graphs used to represent such systems are symmetric, that is, for every arc (u; v) 2 E(G), the arc(v; u) 2 E(G). An alternating sequence v0e1v1e2 : : :vk�1ekvk, (vi�1; vi) 2 E, of nodes and distinctarcs is called a directed trail from v0 to vk , also referred to as a (v1; vk)-ditrail. A directed path, ordipath, is a ditrail whose nodes are distinct. A digraph G is strongly connected if, for every orderedpair of nodes u and v in G, there exists a (u; v)-dipath in G. Any MPC topology, including then-dimensional mesh, must be strongly connected in order to allow every node to send messages toany other node.A unicast operation can be de�ned as an ordered quadruple (u; v; P (u; v); t), where u and v arethe source and destination nodes respectively, P (u; v) is a given (u; v)-ditrail in G over which themessage will be routed, and t is the message-passing step at which the unicast is to take place. Inthis paper, it is assumed that each unicast operation requires one unit of time whose duration isindependent of the path length. This assumption is reasonable for multicasting in present wormhole-routed systems because (1) all copies of the multicast message are the same length; (2) the latency9

of short messages is dominated by the startup latency; and (3) the latency of long messages isdominated by the network latency, which is insensitive to path length. As a result, a unicast thatbegins with step t should terminate at step t+ 1.In modeling a one-port architecture, where a node may send (receive) only one message at atime, two unicasts (u; v; P (u; v); t) and (x; y; P (x; y); �), with t = � , are called feasible if nodesu, v, x, and y are all distinct. Feasibility implies that the two messages will not contend for thesame internal channel, or port. A set of unicasts Ut = f(u1; v1; P (u1; v1); t); (u2; v2; P (u2; v2); t); : : : ;(uk; vk; P (uk; vk); t)g, whose members are pairwise feasible, is called a feasible unicast set. A multi-cast request can be represented by a set M = fd0; d1; d2; : : : ; dm�1g, where node d0 represents thesource processor of the multicast and nodes d1; d2; : : : ; dm�1 represent its destination processors.De�nition 1 An implementation I(M) of a unicast-based multicast request M is a sequence offeasible unicast sets U1; U2; : : : ; Uk satisfying the following conditions:1. For each j, 1 � j � k, if (u; v; P (u; v); j)2 Uj, then both u and v belong to M .2. The set U1 = f(d0; u; P (d0; u); 1)g, where u = di for some i, 1 � i � m� 1.3. For every unicast (u; v; P (u; v); t)2 Ut, 1 < t � k, there must exist a set Uj with j < t whichhas (w; u; P (w; u); j) as a member for some node w.4. For every destination di, 1 � i � m � 1, there exists one and only one integer j such that1 � j � k and (w; di; P (w; di); j) appears in Uj for some node w.The �rst condition in De�nition 1 guarantees that only the source and destination processorsof the given message are involved in the implementation; other local processors in the system areuna�ected, although their routers may be required to relay the message. The second conditionstates that the �rst step of the implementation involves a single unicast from the source to one ofthe destinations. The third condition ensures that a destination processor has received the messagebefore it may forward the message to another destination processor. Finally, the forth conditionguarantees that every destination processor receives the message exactly once.As de�ned above, any implementation will require at least dlog2me steps to reach m � 1 des-tinations. That is, using the above formulation, there does not exist an I(M) with k < dlog2me.This can be seen by observing that the de�nition of a unicast set and Condition 3 above imply thatthe total number of destinations receiving the message cannot increase by more than a factor of two10

during each step. An implementation requiring exactly dlog2me steps is referred to as a minimum-time implementation. A natural question at this stage is the following: Given an arbitrary multicastrequest M , does there exist a minimum-time implementation I(M) such that channel contentionamong the constituent messages is avoided? This question is addressed by considering separatelycontention among messages sent in the same step and contention among messages sent in di�erentsteps, but which may be transmitted concurrently due to startup latency.Two feasible unicast operations (u; v; P (u; v); t) and (x; y; P (x; y); �), with t = � , are calledstepwise contention-free if the ditrails P (u; v) and P (x; y) are arc-disjoint. A multicast implemen-tation is called stepwise contention-free if the elements in each unicast set Ui are pairwise stepwisecontention-free. For example, the multicast trees in Figures 4 (d-e) are stepwise contention-free.Given a source and a destination pair, there may exist many routes between them. If any ofthe routes may be taken, the system is said to use free routing. Otherwise, routing is said tobe restricted. Clearly, contention is more easily avoided in a system that supports free routingthan in one that supports restricted routing. In fact, if the system supports free routing, then astepwise contention-free multicast implementation can be found for any source and destination set,regardless of topology.Theorem 1 Let G(V;E) be a digraph representing an MPC network as de�ned above, and letM = fd0; d1; : : : ; dm�1g be a multicast request issued in G. Then there exists a minimum-timestepwise contention-free implementation for M provided that G admits free unicast routing.Proof (sketch): The proof is done by construction. According to the MPC model describedabove, the digraph G(V;E) is both strongly connected and symmetric. A strongly connecteddigraph G is Eulerian if it contains an Euler ditrail, that is, a ditrail that visits every edge in thedigraph exactly once. It is known that a strongly connected digraph is Eulerian if and only if thein-degree of each node equals the out-degree of that node, a property that is true of G since it issymmetric. As a consequence, there must exist a ditrail T in G with origin d0, which traversesevery edge in G exactly once, and therefore which traverses every destination in M at least once.This ditrail T will be used for routing unicast messages, that is, every unicast message in theimplementation will be sent along the route corresponding to T .Figure 5 illustrates the �rst three steps of an implementation. Without loss of generality, assumethat m = 2k for some k and consider the �rst occurrence of each destination along T . (In case of2k�1 < m < 2k, add the appropriate number of \dummy" destinations toM , and assume that they11

are at the end of T .) In the �rst step of the multicast implementation, node d0 sends a copy ofmessage along T to node dm=2. In the second step, node d0 sends a copy of message to node dm=4,while node dm=2 sends a copy of message to node d3m=4, with both messages traveling along T . Inthe third step, node d0 sends node dm=8, node dm=4 sends node d3m=8, node dm=2 sends node d5m=8,and node d3m=4 sends node d7m=8, This recursive doubling process continues until all destinationswill have received the message.

d0

d m−1

dm/4

dm/2

d3m/4

G

[2]

d0

d m−1

dm/4

dm/2

d3m/4

G[3]

d7m/8

d5m/8

dm/8

d3m/8

d0

d m−1

G

d0

d m−1

dm/2

G

[1]

(b) One unicast in step 1

(c) Two unicasts in step 2

[2]

(d) Four unicasts in step 3

[3]

[3]

[3]

Unicast message sent in stepi

[ ]i

(a) Eulerian ditrail

Eulerian ditrailFigure 5. Multicast implementation using Euler ditrailIt is not di�cult to verify that the sequence of unicast messages is an implementation I(M) ofthe multicast request M , as it satis�es all the conditions given in De�nition 1. Furthermore, sincethe implementation completes in k = log2m steps, it is a minimum-time implementation.Now we consider the potential for contention among the unicast messages constituting themulticast implementation. Those messages transmitted in di�erent steps are stepwise contention-12

free by de�nition. Those messages that are transmitted during the same step are arc-disjoint, sinceeach message uses channels in a di�erent segment of the ditrail T . Therefore, the implementationis stepwise contention-free. This completes the proof of the theorem. 2Theorem 1 applies to any wormhole routed MPC whose topology can be modeled with a sym-metric digraph. In other words, if a message is allowed to follow any route to its destination, thena multicast \tree" can be constructed such that the constituent messages are stepwise contention-free and such that, after dlog2me steps, all destinations will have received the message. Of course,practical wormhole-routed systems do not permit free routing, which would risk deadlock. However,Sections 5 through 7 will show how the same basic strategy can be used to construct minimum-time multicast implementations in some popular topologies, even though the underlying routing isrestricted.Besides restrictions on underlying routing algorithms, other practical considerations arise inimplementations of multicast for actual parallel systems. As illustrated in Figure 4(d), even thoughan implementation may be stepwise contention-free, contention may still arise if the network latencyis small and the receiving latency is large, causing unicasts in di�erent steps to actually be sentsimultaneously. This condition is possible in commercial systems, where the sending and receivinglatencies may be much greater than the network latency of a message. In order to study contentionbetween messages sent in di�erent steps, the de�nition of the the reachable set is needed.De�nition 2 Given a multicast implementation I(M) = fU1; U2; : : : ; Ukg, a node v is in the reach-able set of a node u, denoted Ru, if and only if v = u or there exists a j, 1 � j � k, such that(w; v; P (w; v); j)2 Uj for some node w 2 Ru.If the implementation I(M) is thought of as a directed tree of unicast messages rooted at thesource node d0, then the reachable set of a node u is the set of nodes in the subtree rooted atnode u. In Figure 4(e), for example, R(1;2) = f(1; 2); (1; 1); (1; 0); (0; 0)g. Using this de�nition, theproperties of an implementation necessary to avoid contention between messages sent in di�erentsteps can be characterized. A multicast implementation is said to be depth contention-free if,regardless of overlap in message passing steps caused by startup latency, the constituent messagesare contention-free. An implementation is stepwise contention-free if it is depth contention-free, butthe converse does not hold. The following theorem gives su�cient conditions for an implementationto be depth contention-free. 13

Theorem 2 Given a multicast implementation I(M), if at least one of the following four conditionsholds for every pair of unicasts (u; v; P (u; v); t) and (x; y; P (x; y); �) in I(M), where t � � , thenI(M) is depth contention-free.1. x 2 Rv.2. P (u; v) and P (x; y) are arc-disjoint.3. x = u4. x 2 Rw and (u; w; P (u; w); t+ k) 2 I(M), for some node w and positive integer k.Proof: We need to show that contention does not arise between any pair of unicast messagesin the implementation. Consider two arbitrary unicasts (u; v; P (u; v); t) and (x; y; P (x; y); �), witht � � .Condition 1. If x 2 Rv, as shown in Figure 6(a), then the u-to-v unicast must be completedbefore the x-to-y unicast begins, so they are contention-free. (Note: u 62 Ry since t � � .)Condition 2. If the two paths of the messages, P (u; v) and P (x; y), are arc-disjoint, then thetwo unicasts are contention-free.Condition 3. If x = u, then t < � since u can send only one message at a time. Figure 6(b)illustrates the situation. Node u sends the message to v before sending it to y. Even if � = t + 1and the sending latency is 0, contention will not occur.Condition 4. As shown in Figure 6(c), node u sends the message to v prior to sending it tonode w, which is either an ancestor of x or perhaps x itself. Clearly, node v will have received themessage prior to node x, preventing contention. 2Consider again the two potentially con icting unicasts indicated in Figure 4(d).Using the notation above, the two unicasts are ((1; 0); (1; 2); P ((1; 0); (1; 2)); 2) and((2; 1); (1; 3); P ((2; 1); (1; 3)); 3), where u = (1; 0), v = (1; 2), x = (2; 1), y = (1; 3), t = 2, and� = 3. In the tree shown in Figure 4(d), node (2; 1) is not in the subtree rooted at (1; 2), thatis, (2; 1) 62 R(1;2). Obviously, the two paths P ((1; 0); (1; 2)) and P ((2; 1); (1; 3)) contain a com-mon channel, so they are not arc-disjoint. Also, the two messages start at di�erent sources since(2; 1) 6= (1; 0). Finally, node (2; 1) is not a descendent of node (1; 0). Therefore, none of the fourconditions in Theorem 2 is satis�ed, leaving open the possibility for contention between the twomessages. 14

u = x(b)

y v

u,x

[τ] t[ ]

v

[τ]

t[ ]

u

t+k[ ]

w

x

y

u v beforeu wwhere is an ancestor of xw

(c)y

v

[τ]

t[ ]u

x

(a) x RvFigure 6. Illustrations of Conditions 1, 3, and 4 of Theorem 3In order to attain the practical design goals discussed in Section 3, an implementation mustsatisfy two requirements: it must be completed in minimum-time, that is, it must require onlyk = dlog2(m)e message-passing steps, and it must be depth contention-free. The tree in Figure 4(e)is depth contention-free.5 Dimension-Ordered ChainsDeveloping an algorithm that produces minimum-time, contention-free multicast implementationsfor a speci�c system requires a detailed understanding of potential con icts among messages, whichin turn depend on the routing algorithm used. Instead of free routing, most commercial machineso�er only restricted unicast routing, particularly wormhole-routed systems where deadlock wouldresult if free routing were used [9]. This section formulates a method to avoid contention among uni-cast messages under the most common form of restricted routing for wormhole-routed n-dimensionalmesh systems, namely, dimension-ordered routing.A few preliminaries are in order. A node address x in a �nite n-dimensional mesh is representedby �n�1(x)�n�2(x) : : :�0(x). The distance, or number of hops, between two nodes x and y is de�nedas �(x; y) = �n�1i=0 j�i(y)� �i(x)j. Under a minimal deterministic routing algorithm, all messagestransmitted from a node x to a node y will follow a unique shortest path between the two nodes.Let us represent such a path as P (x; y) = (x; z1; z2; : : : ; zk; y), where the zi's are the sequenceof intermediate routers visited by the message. In order to simplify later proofs, let z0 = x andzk+1 = y. Dimension-ordered routing is a minimal deterministic routing algorithm in which everymessage traverses dimensions of the network in a strict monotonic order. Under dimension-ordered15

routing, each routing step brings the message one hop closer to the destination, along the lowestdimension in which the current node and the destination node di�er.De�nition 3 Given nodes x and y in an n-dimensional mesh, let i be the lowest dimension suchthat �i(x) 6= �i(y). Under dimension-ordered routing, a message sent from x to y will be routed�rst along dimension i to intermediate node z = �n�1(x)�n�2(x) : : :�i+1(x)�i(z)�i�1(y) : : :�0(y),where j�i(z) � �i(x)j = 1 and j�i(y) � �i(z)j = j�i(y) � �i(x)j � 1. At node z, the same routingalgorithm is invoked to determine the next intermediate node.In this paper, it is assumed that dimension-ordered routing \resolves" dimensions from lowest tohighest. For example, a message sent from node 101101 to node 001010 in a 6-cube will traverse thepath P (101101; 001010) = (101101; 101100; 101110; 101010; 001010). Of course, dimension-orderedrouting can also be de�ned to select the highest dimension in which the source and destination di�er.The multicast algorithms described subsequently can be with either underlying routing technique.In order to characterize contention among messages transmitted under dimension-ordered rout-ing, an ordering on nodes in an n-dimensional mesh is needed. The multicast algorithms describedherein are based on lexicographic ordering of the source and destination nodes according to theiraddress components. Actually, two such orderings are possible: one in which the subscripts ofaddress components increase from right to left, and another in which the subscripts are reversed.Which ordering is appropriate for multicasting in a given system depends on whether addresses areresolved, under dimension-ordered routing, in a top-to-bottom or bottom-to-top manner. We willrefer to the ordering relation as dimension order.De�nition 4 The binary relation \dimension order," denoted <d, is de�ned between two nodesx and y as follows: x <d y if and only if either x = y or there exists an integer j such that�j(x) < �j(y) and �i(x) = �i(y) for all i, 0 � i � j � 1.In a 5-cube, for example, 00000 <d 10100 <d 10010. Since <d is simply lexicographic ordering,it is a total ordering on the nodes in an n-dimensional mesh. Therefore, it is re exive, antisymmetricand transitive. Given a set of node addresses, they can be arranged in a unique, ordered sequenceaccording to the <d relation.De�nition 5 A sequence of nodes x1; x2; x3; : : : ; xm is a dimension-ordered chain if and only if allthe elements are distinct and either (1) xi <d xi+1 for 1 � i < m or (2) xi <d xi�1 for 1 < i � m.16

The following lemmas address contention among messages sent between nodes whose addressesare arranged as a dimension-ordered chain.Lemma 1 If u <d v <d x <d y, then dimension-ordered routes P (u; v) and P (x; y) are arc-disjoint.Proof: The proof is done by contradiction. Assume that there exists a common arc (r; s) inP (u; v) and P (x; y). Let h be the dimension in which (r; s) travels. Since the arc (r; s) is in thepath P (u; v), then according to dimension-ordered routing, �i(r) = �i(v), for 0 � i < h. Similarly,�i(r) = �i(y), for 0 � i < h.Thus, �i(v) = �i(y) for 0 � i < h. Since v <d x <d y, then by the de�nition of dimensionorder, �i(v) = �i(x) = �i(y) for 0 � i < h. Moreover, �h(v) � �h(x) < �h(y) by the de�nitionof dimension order and the presence of dimension-h arc (r; s) in P (x; y). Since �h(x) < �h(y),the consecutive dimension-h arcs in P (x; y) will resolve �h(x) to �h(y) in increasing order. So�h(s) = �h(r) + 1, implying that �h(s) > �h(x).By similar reasoning, since �h(r) < �h(s) and (r; s) lies on P (u; v), it follows that �h(s) � �h(v).But since �h(v) � �h(x), by transitivity �h(s) � �h(x), which contradicts the result that�h(s) > �h(x). Therefore, the assumption that there exists a common arc (r; s) in P (u; v) andP (x; y) does not hold, and the lemma is proved. 2Lemma 2 If u <d v <d x <d y, then P (y; x) and P (v; u) are arc-disjoint.Proof: The proof is nearly identical to that used to prove Lemma 1. 2Lemmas 1 and 2 are critical to the development of e�cient multicast algorithms because theyindicate how channel contention may be avoided. The chain algorithm is a distributed algorithmthat can be used to multicast a message from a source node to one or more destinations. Thealgorithm applies to situations in which the address of the source node is either less than orgreater than those of all the destinations, according to the <d relation. Figure 7 gives the chainalgorithm executed at each node. The source address and the destination addresses are arranged asa dimension-ordered chain in either increasing or decreasing order, with the source node occupyingthe position at the left end of the chain. Similar to the multicast implementation approach describedin Section 4, the source node sends �rst to the destination node halfway across the chain, then tothe destination node one quarter of the way across the chain, and so on. Each destination receives17

a copy of the message from its parent in the tree and may be responsible for forwarding the messageto other destinations. The message carries the addresses of those nodes to be in the subtree rootedat the receiving node. (Alternatively, a compiler or run-time software could determine the subtreesfor intermediate recipients of the message, allowing multicast routing tables to be constructed whenthe application begins execution [3].) Unlike the construction in the proof of Theorem 1, whichassumed that free routing was possible, the chain algorithm is designed to produce minimum-timemulticast implementations atop dimension-ordered unicast routing.Algorithm 1: The Chain AlgorithmInput: Dimension-ordered chain fdleft; dleft+1; : : : ; drightg,where dleft is the local address.Output: Send blog2(right� left+ 1)c messagesProcedure:while left < right docenter = left+ d right�left2 e;D=fdcenter ; dcenter+1; : : : ; drightg;Send a message to node dcenter with the address �eld D;right = center � 1endwhileFigure 7. The chain algorithm for multicastFigure 8 shows a multicast implementation resulting from the chain algorithm in a 4 � 4 2Dmesh that uses XY routing. The set of nodes involved is the same as in Figure 3, however, node(0,0) is the source rather than node (2,1). The source node and the seven destinations have beenarranged as a dimension-ordered chain. In this case, the X dimension is considered the low-orderdimension, and the Y dimension is considered the high-order dimension. Although some messagesare passed through multiple routers before reaching their destinations, it turns out that channelcontention will not occur among the messages, regardless of message length or startup latency.Theorem 3 The multicast implementation resulting from the chain algorithm is a minimum-time,depth contention-free implementation.Proof: From construction of the algorithm, each destination node receives the message exactlyonce. For m � 1 destinations, the source node sends at most dlog2me messages sequentially. Adestination node receiving the message that was sent in the ith step will send at most dlog2me � i18

messages. Therefore, the height of the multicast tree, that is, the number of steps required to reachall destinations, is dlog2me which is minimum for m� 1 destinations.In order to prove that the algorithm is depth contention-free, it must be shown that every pairof messages in the implementation satis�es at least one of the four conditions in Theorem 2. Herewe consider only the case where the address of the source is <d those of the destinations; the prooffor the other case is similar.Consider any two constituent unicast messages transmitted in the multicast operation,(u; v; P (u; v); t) and (x; y; P (x; y); �). Without loss of generality, assume that t � � . If x 2 Rv,then Condition 1 of Theorem 2 is satis�ed; the two messages cannot simultaneously require thesame channel. If x 62 Rv, then two cases must be considered, namely, t = � and t < � . If t = � ,then either u <d v <d x <d y or x <d y <d u <d v by construction of the algorithm; by Lemma 1,P (u; v) and P (x; y) are arc-disjoint, satisfying Condition 2 of Theorem 2.If t < � and x = u, then Condition 3 of Theorem 2 is satis�ed. If t < � and x 2 Ru butx 6= u, then two cases are permitted by construction of the chain algorithm. If u <d v <d x <y ,then the two multicasts are arc-disjoint by Lemma 1, and Condition 2 of Theorem 2 is satis�ed. Ifu <d x <d y <d v, then Condition 4 of Theorem 2 is satis�ed since u must send to x or an ancestorof x after sending to v.Finally, if t < � and x 62 Ru, then either u <d v <d x <d y or x <d y <d u <d v, because alldestinations between (according to <d) u and v are in Ru; by Lemma 1, P (u; v) and P (x; y) arearc-disjoint and, again, Condition 2 of Theorem 2 is satis�ed. Thus, the chain algorithm producesa depth contention-free multicast implementation. 26 Multicast in HypercubesAs presented above, the chain algorithm is only applicable to those cases in which the source addressis less than or greater than (according to <d) all the destination addresses. Clearly, this situation isnot true in general. For a hypercube network in which E-cube routing is used, it is straightforwardto construct a depth contention-free multicast algorithm using the chain algorithm. Speci�cally,the symmetry of the hypercube topology e�ectively allows the source node to play the role of the�rst node in a dimension-ordered chain. The exclusive-or operation, denoted �, is used to carryout this task. 19

[3]

[2]

[3]

Only Router UsedDestination NodeSource Node

2,12,01,31,21,11,0

[3]

[2]

[3]

3,2

2,1

1,0

[1]

0,0

1,01,1

1,22,3 2,2

2,3 2,2 3,1

Figure 8. Multicast chain example in 4� 4 meshDe�nition 6 A sequence d1; d2; : : : ; dm�1 of hypercube addresses is called a d0-relative dimension-ordered chain if and only if d0 � d1; d0 � d2; : : : ; d0 � dm�1 is a dimension-ordered chain.Let d0 be the address of the source of a multicast with m � 1 destinations. The source caneasily sort the m� 1 destinations into a d0-relative dimension-ordered chain, � = d1; d2; : : : ; dm�1.The source may then execute the chain algorithm using � instead of the original addresses. Themulticast tree resulting from this method is called a Unicast-cube, or U-cube, tree [4]. An interestingand useful property of the U-cube tree involves broadcast: the well known spanning binomialtree [11] is a special case of the U-cube tree when the source node and all destinations form asubcube.Figure 9 gives an example of the U-cube algorithm in a 5-cube. The source node 11010 is sendingto a set of six destinations f00001; 01000; 01101; 01110; 11011; 11100g. Taking the exclusive-or ofeach destination address with 11010 and sorting the results produces the (11010)-relative dimension-ordered chain � = 11010; 01110; 01000; 11100; 11011; 00001; 01101. The corresponding U-cube treeis shown in Figure 9. It takes 3 steps for all destination processors to receive the message.Theorem 4 The implementation constituting a U-cube tree is a minimum-time, depth contention-free implementation.Proof: Consider the multicast tree T 0 resulting from relabeling each node i in the hypercubewith i�d0, where the source node is 0 and the destination nodes are di�d0 for i = 1; : : : ; m�1. By20

Only Router Used

{1]

[2] [3]

[2]

[3] [3]

11010 01110 01000 11100 11011 00001 01101

11000

11110

11111

11101

11101 11001

1000100101

00000 10100 10010 00110 00001 11011 10111

Destination NodeSource Node

11010i

01100

d

d −relative chain0

Figure 9. Multicast chain example in a 5-cubeTheorem 3, this tree is a depth contention-free, minimum-time implementation. By symmetry ofthe hypercube topology, there exists a homomorphism between the nodes and channels comprisingT 0 and the U-cube tree constructed for the original source and destination addresses constitutingthe d0-relative dimension-ordered chain. Therefore, the U-cube tree is also a depth contention-free,minimum-time implementation. 2Figures 10 and 11 compare three implementations of the multicast operation in a 64-nodenCUBE-2: separate addressing, LEN tree, and U-cube tree. A LEN tree [26] is constructed using adistributed greedy algorithm that was originally designed for hardware implementation in networkswith virtual cut-through switching [27]. These factors a�ect its suitability as a software tree inwormhole-routed direct networks. The U-cube tree takes advantage of the distance-insensitivity ofwormhole routing and avoids the use of processors that are neither the source nor one of the desti-nations. Although the nCUBE-2 actually provides a multiple-port communication architecture [7],21

both the LEN and U-cube algorithms will often take advantage of multiple ports inadvertently bysending subsequent messages in di�erent dimensions.The destinations were randomly chosen in these experiments. The plotted results represent theaverage over a large number of sample destination sets. Figure 10 compares the LEN tree andthe U-cube tree in terms of the number of local processors involved in the multicast operation.While the U-cube tree is optimal in this regard, the LEN tree requires additional local processorsto forward messages. In this particular set of tests, the di�erence between the two algorithms isapproximately constant over the range of destination set sizes.0510152025303540

9 15 20 25 29Number of DestinationsNumber ofProcessorsInvolved LEN treeU-cube tree 22 2 2 2 2 2 2 2 2 2 2Figure 10. Multicast tree in a 64-node nCUBE-2Figure 11 plots the average multicast latency for a message length of 10 bytes. The multicastlatency of separate addressing increases linearly with the number of destinations. The slope of thecurve is approximately 95. In separate addressing, the receiving latency of every destination exceptthat last one is overlapped with the sending latency for the next destination. Therefore, this result isconsistent with our measured sending latency of 95 microseconds. The U-cube performance followsa curve approximately equal to 150 log2(m), where m� 1 is the number of destinations. Since themeasured receiving latency is 75 �sec, one would expect the formula to be 170 log2(m) on a one-portarchitecture. However, since the nCUBE-2 o�ers multiple ports, overlap of message transmissionse�ectively reduces the height of the tree, resulting in lower multicast latency. The advantage of theU-cube tree over the LEN tree varies from approximately 300 �sec for 9 destinations to 150 �secfor 29 destinations. The advantage in terms of additional processors (Figure 10) is approximately22

5 processors in all cases. For smaller trees, the inclusion of these extra nodes has a greater a�ecton tree height than in larger trees, resulting the multicast latency curves being closer for largernumbers of destinations.0500100015002000250030009 15 20 25 29Number of DestinationsMulticastLatency(�sec) Separate AddressingLEN tree 22 2 2 2 2 2 2 2 2 2 2U-cube tree 44 4 4 4 4 4 4 4 4 4 4Figure 11. Multicast latency in a 64-node nCUBE-27 Multicast in MeshesUnlike hypercubes, general n-dimensional meshes are not symmetric. The source address may liein the middle of a dimension-ordered chain of destination addresses, but the exclusive-or operationis not applicable in the implementation of depth contention-free communication. However, anotherrelatively simple method may be used, again based on the chain algorithm, to address this problem.The U-mesh algorithm is given in Figure 12. The source and destination addresses are sortedinto a dimension-ordered chain, denoted �, at the time when multicast is initiated by calling theU-mesh algorithm. The source node successively divides � in half. If the source is in the lowerhalf, then it sends a copy of the message to the smallest node (with respect to <d) in the upperhalf. That node will be responsible for delivering the message to the other nodes in the upper half,using the same U-mesh algorithm. If the source is in the upper half, then it sends a copy of themessage to the largest node in the lower half. In addition to the data, each message carries theaddresses of the destinations for which the receiving node is responsible. At each step, the sourcedeletes from � the receiving node and those nodes in the half not containing the source. The sourcecontinues this procedure until � contains only its own address. Note that if the source happens tolie at the beginning or end of �, then the U-mesh algorithm degenerates to the chain algorithm.23

In addition, when executed at an intermediate node in the tree, the U-mesh algorithm is, again,simply the chain algorithm.Algorithm 2: U-mesh AlgorithmInputs: �: dimension-ordered chain fDleft; Dleft+1; : : : ; Drightgfor source and destinationsDsource: the address of source nodeOutput: Send blog2 (right� left+ 1)c messagesProcedure:while left < right doif source < left+right2 then /* send right */center = left + d right�left2 e;D = fDcenter; Dcenter+1; : : : ; Drightg;right = center � 1;else if source > left+right2 then /* send left */center = left + b right�left2 c;D = fDleft; : : : ; Dcenter�1; Dcenterg;left = center + 1;else /* send left */center = source� 1;D = fDleft; : : : ; Dcenter�1; Dcenterg;left = source;endifSend a message to node Dcenter with the address �eld D;endwhile Figure 12. The U-mesh algorithmFigure 13 depicts a multicast in a 6 � 6 2D mesh. Node (3; 3) is the source of a multicastmessage destined for the 16 shaded nodes. Figure 14 shows the result of the U-mesh algorithm forthis example; intermediate routers are not shown. The source begins with a dimension-ordered chain� = (0; 1); (0; 2); (0; 4); (1; 0); (1; 3); (1; 5); (2; 0); (2; 2); (2; 3); (2; 5); (3; 0); (3; 2); (3; 3); (3; 4); (3; 5);(4; 1); (5; 2). As shown in Figure 14, the source (3; 3) �rst sends to node (2; 3), the node withthe highest address in the lower half of �. The lower half is deleted from �, and therefore thenodes remaining in � are (2; 5); (3; 0); (3; 2); (3; 3); (3; 4); (3; 5); (4; 1); (5; 2). Node (3,3) next sendsto node (3,4), the node with the lowest address in the upper half. The new sequence � becomes(2; 5); (3; 0); (3; 2); (3; 3). The next recipient is the node with the highest address in the lower halfof �, namely (3; 0). Finally, node (3,3) sends to node (3,2). Each of the receiving nodes is likewiseresponsible for delivering the message to the nodes in its subtree using the chain algorithm. Asshown in Figure 14, the multicast operation requires 5 steps.24

3,00,0 1,0 2,0 4,0 5,0

0,1 1,1 2,1 3,1 4,1 5,1

0,2 1,2 2,2 3,2 4,2 5,2

0,3 1,3 2,3 3,3 4,3 5,3

0,4 2,4 3,4 4,4 5,4

0,5 1,5 2,5 3,5 4,5 5,5

destination node

source node

1,4

high region

low regionFigure 13. U-mesh regions for 16 destinations in a 2D meshInspection of Figures 13 and 14 shows that if the constituent unicast messages follow XY routing,then no contention is possible among them. Two regions, low and high, are de�ned on either side(with respect to <d) of the source node. By the construction of the U-mesh algorithm, any messagesent by a node i in the high region will be destined for another node j, i <d j, in the high region.Similarly, any message sent by a node i in the low region will be destined for another node j,j <d i, in the low region. Stated in other terms, any reachable set includes nodes in either thelow region or high region, but not both. This property can be used to prove depth contention-freemessage transmission within each region and, furthermore, that no channel contention can exist onthe boundary between the two regions. For the latter situation, we require the following lemma.Lemma 3 If u <d v <d x <d y, then dimension-ordered routes P (v; u) and P (x; y) are arc-disjoint.Proof: The proof is by contradiction and similar to that of Lemma 1. Assume that there existsa common arc (r; s) in P (v; u) and P (x; y). Let h be the dimension in which (r; s) travels. Sincethe arc (r; s) is in the path P (v; u), then according to dimension-ordered routing, �i(r) = �i(u), for0 � i < h. Similarly, �i(r) = �i(y), for 0 � i < h. Thus, �i(u) = �i(y) for 0 � i < h.Since u <d v <d x <d y, then by the de�nition of dimension order, �i(u) = �i(v) = �i(x) = �i(y)for 0 � i < h. By the de�nition of dimension order, since u <d v and (r; s) lies on P (v; u), we have25

3,3

2,2

1,0

0,2

0,1

0,41,5

2,0 1,3

3,2

2,5

2,33,0

3,5 4,1

5,2

3,4

[1][2][3]

[4]

[2][3]

[3]

[3][4] [4] [4]

[4] [4] [4]

[4]

[5]Figure 14. U-mesh tree for 16 destinations in a 2D meshthat �h(s) = �h(r)� 1. On the other hand, since x <d y and (r; s) lies on P (x; y), we have that�h(s) = �h(r) + 1, which is a contradiction. Therefore, the lemma is proved. 2The reader will notice that Lemmas 1-3 cover three of the possible four cases for the sending oftwo unicast messages among four dimension-ordered nodes, u <d v <d x <d y. The fourth case, inwhich u sends to v and y sends to x is not guaranteed to be contention-free because P (u; v) andP (y; x) are not arc-disjoint. As an example, consider again the 2D mesh in Figure 13. Although(1; 1) <d (2; 4)<d (2; 5)<d (4; 2), the XY paths P ((1; 1); (2; 4)) and P ((4; 2); (2; 5)) overlap betweennodes (2; 2) and (2; 4).Theorem 5 The implementation constituting a U-mesh tree is a minimum-time, depth contention-free implementation.Proof: As in the chain algorithm, the source node sends at most dlog2me messages sequentially.A destination node receiving the message that was sent in the ith step will send at most dlog2me�imessages. Therefore, the height of the multicast tree, that is, the number of steps required to reachall destinations, is dlog2me which is minimum for m� 1 destinations.In order to show that the U-mesh algorithm is depth contention-free, two cases must be consid-ered: pairs of messages that are both in the same region (low or high) and and pairs of messages26

in di�erent regions. It is observed that, within either the high or low region, the U-mesh algorithmexecutes exactly the same as the chain algorithm, which is depth contention-free by Theorem 3.For the other case, let (u; v; P (u; v); t) and (x; y; P (x; y); �) be two U-mesh unicasts, where uand v are in low region and x and y are in high region. By construction of the algorithm, it is truethat v <d u <d x <d y. By Lemma 3, the two messages are arc-disjoint, satisfying Condition 2 ofTheorem 2. Therefore, the U-mesh implementation is depth contention-free. 2The U-mesh algorithm has been implemented on a 168-node Symult 2010 multicomputer, a12� 14 2D mesh system located at Caltech. A set of tests was conducted to compare the U-meshalgorithm with separate addressing and the Symult 2010 system-provided multidestination serviceXmsend. In separate addressing, the source nodes sends an individual copy of the message toevery destination. As described in [28], the Xmsend function was implemented to exploit whatevere�cient hardware mechanisms may exist in a given system to accomplish multiple-destinationsends. If such mechanisms are not present, then Xmsend is implemented as a library function thatperforms the necessary copying and multiple unicast calls. Figure 15 plots the multicast latencyvalues for implementations of the three methods. A large sample of destination sets were randomlychosen in these experiments. The plotted multicast latency is the average multicast latency overthe samples. The message length was 200 bytes. The multicast latency with separate addressingis approximately 273(m� 1); with the U-cube algorithm, the latency is approximately 270 log2m.The experimental results demonstrate the superiority of the U-mesh multicast algorithm.0500010000150002000025000300003500040000

20 40 60 80 100 120Number of DestinationsMulticastLatency(�sec) Separate addr.Umesh tree 222 2 2 2 2Xmsend call 444 4 4 4 4Figure 15. Multicast comparison in Symult 201027

8 Conclusions and Future DirectionsThis paper has presented e�cient implementations of multicast communication services for scalablemultiprocessors. Multicast communication may be used directly by parallel applications, in sup-port of higher-level SPMD operations, or in the implementation of distributed shared memory. Thesystem model, a wormhole-routed n-dimensional mesh, encompasses several commercial machines.The U-cube and U-mesh multicast algorithms produce depth contention-free, minimum-time mul-ticast trees. Furthermore, the constituent unicast messages do not contend for the same channels,regardless of message length or startup latency. In both algorithms, the number of message passingsteps required to multicast data to m � 1 destinations is dlog2me, which is optimal for one-portarchitectures. Contention among constituent messages is avoided in spite of the use of deterministic,dimension-ordered routing of unicast messages. The proposed methods have been implemented ona 64-node nCUBE-2 and 168-node Symult 2010; performance measurements demonstrate theiradvantage over other approaches.Several related areas are open to further study. The algorithms presented in this paper aredesigned explicitly for one-port architectures. Some multicomputers [29] support a multiple-portarchitecture, in which a processor may send more than one message at the same time, signi�cantlychanging the design parameters of a multicast tree. Although the proposed algorithms may inad-vertently use multiple ports in parallel on such platforms, algorithms that are designed explicitlyfor multiple-port architectures can achieve better performance; unicast-based multicast algorithmsfor all-port hypercubes are described in [30].Additional areas to be addressed include adaptive and fault-tolerant multicast communication inwormhole-routed networks. Finally, the Umesh algorithm is applicable to any distance-insensitiveswitching strategy, including circuit switching, so the algorithm may be implemented and testedon circuit-switched systems.AcknowledgementsThe authors gratefully acknowledge the contributions of David F. Robinson, who o�ered suggestionsto simplify the proofs of several lemmas and theorems in the paper. The authors would also liketo thank Dr. C.-T. Ho and the anonymous reviewers for their many useful suggestions regardingthis work. This work was supported in part by NSF grants CDA-9121641, MIP-9204066, CDA-28

9222901, and CCR-9209873, by the Department of Energy under grant DE-FG02-93ER25167,andby an Ameritech Faculty Fellowship. The authors thank the Computer Science Departments atPurdue University and Caltech for the use of their nCUBE-2 and Symult 2010, respectively. Inaddition, this work was made possible in part by the Scalable Computing Laboratory which isfunded by Iowa State University and the Ames Laboratory { USDOE, Contract W-7405-ENG-82.References[1] R. F. DeMara and D. I. Moldovan, \Performance indices for parallel marker-propagation," inProceedings of the 1991 International Conference on Parallel Processing, pp. 658{659, 1991.St. Charles, Illinois, Aug. 12-17.[2] V. Kumar and V. Singh, \Scalability of parallel algorithms for the all-pairs shortest pathproblem," Tech. Rep. ACT-OODS-058-90, Rev. 1, MCC, Jan. 1991.[3] P. K. McKinley, H. Xu, E. Kalns, and L. M. Ni, \ComPaSS: E�cient communication servicesfor scalable architectures," in Proceedings of Supercomputing'92, pp. 478 { 487, Nov. 1992.[4] H. Xu, P. K. McKinley, and L. M. Ni, \E�cient implementation of barrier synchronization inwormhole-routed hypercube multicomputers," Journal of Parallel and Distributed Computing,vol. 16, pp. 172{184, 1992.[5] M. Metcalf and J. Reid, Fortran 90 Explained. Oxford University Press, 1990.[6] K. Li and R. Schaefer, \A hypercube shared virtual memory," in Proc. of the 1989 InternationalConference on Parallel Processing, vol. I, pp. 125 { 132, Aug. 1989.[7] NCUBE Company, NCUBE 6400 Processor Manual, 1990.[8] C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Seizovic, C. S. Steele, and W.-K. Su,\The architecture and programming of the Ametek Series 2010 multicomputer," in Proceedingsof the Third Conference on Hypercube Concurrent Computers and Applications, Volume I,(Pasadena, CA), pp. 33{36, ACM, Jan. 1988.[9] L. M. Ni and P. K. McKinley, \A survey of wormhole routing techniques in direct networks,"IEEE Computer, vol. 26, pp. 62{76, Feb. 1993.[10] W. J. Dally and C. L. Seitz, \The torus routing chip," Journal of Distributed Computing,vol. 1, no. 3, pp. 187{196, 1986.[11] S. L. Johnsson and C.-T. Ho, \Optimum broadcasting and personalized communication inhypercubes," IEEE Transactions on Computers, vol. C-38, pp. 1249{1268, Sept. 1989.[12] Intel Corporation, A Touchstone DELTA System Description, 1991.[13] W. J. Dally, J. A. S. Fiske, J. S. Keen, R. A. Lethin, M. D. Noakes, P. R. Nuth, R. E.Davison, and G. A. Fyler, \The message-driven processor: A multicomputer processing nodewith e�cient mechanisms," IEEE Micro, pp. 23{39, Apr. 1992.29

[14] R. Duzett and R. Buck, \An overview of the nCUBE-3 supercomputer," in Proc. Frontiers'92:The 5th Symposium on the Frontiers of Massively Parallel Computation, pp. 458{464, Oct.1992.[15] X. Lin, P. K. McKinley, and L. M. Ni, \Deadlock-free multicast wormhole routing in 2Dmesh multicomputers." accepted to appear in IEEE Transactions on Parallel and DistributedSystems.[16] C.-T. Ho and S. L. Johnsson, \Distributed routing algorithms for broadcasting and personal-ized communication in hypercubes," in Proceedings of the 1986 International Conference onParallel Processing, pp. 640{648, Aug. 1986.[17] C. R. Jesshope, P. R. Miller, and J. T. Yantchev, \High Performance Communications in Pro-cessor Networks," in Proceedings of IEEE 16th Annual International Symposium on ComputerArchitecture, pp. 150{157, 1989.[18] D. H. Linder and J. C. Harden, \An adaptive and fault tolerant wormhole routing strategyfor kary n-cubes," IEEE Transactions on Computers, vol. 40, pp. 2{12, Jan. 1991.[19] J. Duato, \On the design of deadlock-free adaptive routing algorithms for multicomputers:design methodologies," in Proceedings of 1991 Parallel Architectures and Languages EuropeConference (PARLE'91), 1991.[20] C. J. Glass and L. M. Ni, \The turn model for adaptive routing," in Proc. of the 19th AnnualInternational Symposium on Computer Architecture, pp. 278{287, May 1992.[21] P. Berman, L. Gravano, J. Sanz, and G. Pifarre, \Adaptive deadlock- and livelock-free rout-ing with all minimal paths in torus networks," in Proc. 4th ACM Symposium on ParallelAlgorithms and Architectures, pp. 3{12, June 1992.[22] W. J. Dally and H. Aoki, \Deadlock-free adaptive routing in multicomputer networks usingvirtual channels," IEEE Transactions on Parallel and Distributed Systems, vol. 4, pp. 466{475,Apr. 1993.[23] X. Lin, P. K. McKinley, and L. M. Ni, \The message ow model for routing in wormhole-routed networks," in Proc. of the 1993 International Conference on Parallel Processing, vol. I,pp. 294{297, 1993.[24] W. J. Dally, \Virtual channel ow control," IEEE Transactions on Computers, vol. 3, pp. 194{205, Mar. 1992.[25] W. J. Dally and C. L. Seitz, \Deadlock-free message routing in multiprocessor interconnectionnetworks," IEEE Transactions on Computers, vol. C-36, pp. 547{553, May 1987.[26] Y. Lan, A.-H. Esfahanian, and L. M. Ni, \Multicast in hypercube multiprocessors," Journalof Parallel and Distributed Computing, pp. 30{41, Jan. 1990.[27] Y. Lan, L. M. Ni, and A.-H. Esfahanian, \A VLSI router design for hypercube multiprocessors,"Integration: The VLSI Journal, vol. 7, pp. 103{125, 1989.[28] Ametek Computer Research Division, Arcadia, California, Ametek System 14, Mars SystemSoftware User's Guide Version 1.0, 1987. 30

[29] S. B. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore,C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb, \iWarp:An integrated solution to high-speed parallel computing," in Proceedings of Supercomputing'88,pp. 330{339, Nov. 1988.[30] D. F. Robinson, D. Judd, P. K. McKinley, and B. H. C. Cheng, \E�cient collective datadistribution in all-port wormhole-routed hypercubes," in Proceedings of Supercomputing'93,pp. 792{803, November 1993.

31