Hot-potato algorithms for permutation routing

1168 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL 6, NO 1 1 , NOVEMBER 1995

Algorithms for Permutation Routing Ilan Newman and Assaf Schuster

Abstract-We develop a methodology for the design of hot- potato algorithms for routing permutations. The basic idea is to convert existing store-and-forward routing algorithms to hot- potato algorithms. Using it, we obtain the following complexity bounds for permutation routing:

e n x n Mesh: 7n + o(n) steps. 0 2“ hypercube: O(n2) steps. e n x n Torus: 4n + o(n) steps.

The algorithm for the two-dimensional grid is the first to be both deterministic and asymptotically optimal. The algorithm for the 2”-nodes Boolean cube is the first deterministic algorithm that achieves a complexity of o(2”) steps.

Index Terms-Deflection routing, packet routing, parallel algorithms.

I. INTRODUCTION

HIS work studies the problem of permutation routing in synchronous networks of processors in which at most one

packet can traverse any directed link in each time step. We consider a class of algorithms known as hot-potato or deflection routing algorithms [ l l , [81, [91, [ l l ] , [16], [17], [24], [2S]. The important characteristic of these algorithms is that they use no buffer space for storing delayed packets. Each packet must leave the processor at the step following its arrival, unless it has arrived at its destination. Packets may arrive to a processor from all its neighbors and have to be redirected each on a different outgoing link. This may cause some packets to be “deflected” away from their preferred direction. Such unfor- tunate situation cannot happen in the traditional “store-and- forward” routing in which a packet can be stored at a processor until it can be transmitted to its preferred direction.

Variants of hot-potato routing are used by parallel machines such as the HEP multiprocessor [22] and the Connection Ma- chine [12] and by high-speed communication networks [17]. In particular, hot-potato routing is very important in fine-grained massively-parallel computers, such as the Caltech Mosaic C [21]. For such machines the addition of even a small sized storage buffer at each processor will cause a substantial in- crease in the cost of the machine. Another domain in which deflection-type routing is highly desirable is optical networks [I], [8], [24], [2S]. The reason is that storage must take elec- tronic form which implies the need to convert from (and back to) the optical form.

The first hot-potato algorithm was proposed by Baran [2]. Borodin and Hopcroft, in a landmark paper, suggested a hot-

Manuscript received July 29, 1993; r e v i d N o v . 4, 1994. I. Newman is with the Department of Mathematics and Computer Science,

A. Schuster is with the Department of Computer Science, Technion, Haifa,

To order reprints of this article, e-mail: transactions @computer.org, and

Haifa University, Haifa, Israel; e-mail: [email protected].

Israel 32000; e-mail: [email protected].

reference IEEECS Log Number D95053.

potato algorithm for the hypercube [4]. Prager [20] showed that the Borodin-Hopcroft algorithm terminates in log N steps on the N-nodes hypercube for a special class of permutations. Numerous experimental results on hot-potato routing have been published [l], [8], [16], [17]. Feige and Raghavan [7] presented algorithms for the two dimensional torus and for the hypercube. Their algorithm for the n x n two dimensional torus has good performance for random instances. It also has good performance on any instance if some random delays are incor- porated. Kaklamanis et al. [ 131 gave randomized and average- case algorithms for d-dimensional grids with improved constants. For these algorithms, however, the worst case complexity is not proven to be better than nd, i.e., the size of the network. Recently Bar-Noy et al. [6] gave a deterministic algorithm of complexity n2 ognloglogn) for permutation routing on the two-dimensional mesh.

Up to this work, we are not aware of any deterministic asymptotically optimal algorithm for mcshes. However, the works that are mentioned above focus on the simplicity of the algorithms. Recently, using our methods, the result for the mesh was further improved in [14].

A. This Work We restrict ourselves to routing of permutations (or in fact

any 1-1 partial mapping). We develop a simple methodology for the design of hot-potato algorithms for routing on networks. The method uses an extension of the fact that on many networks routing can be “reduced” to sorting as follows: First, sort the packets according to their destinations (If necessary, let empty sources create dummy packets destined to processor -). After sorting, either each packet already reached its destination or the routing can be completed without congestion [lS]. The general approach of transforming such an algorithm to an hot- potato algorithm is mentioned also in [7] for the torus, however its application raises several inherent difficulties which we solve here. Furthermore, we extend the method to many other networks. The main difficulty with sorting is that the networks we are dealing with are bipartite (meshes and the Boolean hypercube). Thus, an hot-potato algorithm that moves every packet at each step cannot sort in general, since two packets of odd distance can never meet. To solve this problem we divide the network in question into the two color classes (or sometimes to a more refined partition). We sort simultaneously each of the classes and then complete the routing by a correction phase. The other difficulty is that some packets have to wait at some locations during intermediate steps which violate the hot-potato paradigm. The solution is to “vibrate” such packets back and forth around their supposed locations.

Using this basic idea we were able to obtain hot potato algorithms with the following time bounds:

1045-9219/95$04.00 0 1995 IEEE

mailto:computer.org

NEWMAN AND SCHUSTER: HOT-POTATO ALGORITHMS FOR PERMUTATION ROUTING 1 I69

n x n Mesh: 7n + o(n) steps (Section 11). 2" hypercube: O(n2) steps (Section 111). n x n Torus: 4n + o(n) steps (Section IV).

The method yields additional deterministic results for other networks as well, such as a O(d2n log2 n) for the d-dimensional grid, and O(n3) for the 2"-nodes shuffle exchange with odd n. These results will not be described here.

B. The Model

Following [7] we define deflection networks as follows. A network is a directed graph whose nodes are processors and whose edges are unidirectional links between processors. We think of them as undirected graphs in which every edge repre- sents two directional links in opposite directions.

The network is synchronous. At each step, a processor may receive up to one packet from each incoming edge and submit up to one packet along each outgoing edge. After submitting and receiving of packets, some prespecified, standard operations are performed by the processor on the headers of the incoming packets. More specifically, our algorithms use the ability to read and compare the destinations of two packets. Each node has as many outgoing edges as incoming ones. Note that this property ensures that each packet entering a processor at time step t will have a free outgoing link available to leave the processor at time step t + 1 (even though it may be in the "wrong direction"). Nodes have no buffers, so packets are never stored at intermediate nodes. When a packet reaches its destination it is absorbed there and disappears from the system.

11. ROUTING ON A TWO-DIMENSIONAL MESH

We denote by M , the n x n 2-dimensional mesh, having n columns and n rows and a node at each intersection point. We assume that the columns and rows are numbered 0, ..., n - 1 , and that the numbering is in a consecutive order say, left-to- right for columns, bottom-up for rows. A processor is identified by a pair (column, row) where 0 I column, row I n - 1. The column major order on the points of a rectangular mesh is the lexicographic order by the pair (row, col). In other words it increases along the columns. The row major order is defined analogously. A snake-like order along columns (rows) increases at every even column (row) and decreases at every odd column (row). For the sake of clarity of algorithm description we assume that nl2 and n1I4 are positive integers. THEOREM 2.1. There is a hot-potato algorithm that routes any

(partial) permutation on M,, in at most 7n + o(n) steps.

We follow the general idea described in Section I.A. Before we present the proof we need some definitions.

We color the points of M,, by two colors according to the parity of the sum of their coordinates. We call black the points with an even sum and white the others. We attribute the color black to the packets originating at black points and white to the packets originating at white points. The columns are divided into adjacent pairs where the left column is even numbered and the right is odd. We refer to such a pair as a column pair. Similarly, an adjacent even-odd pair of rows is called a row

pair. We view all the black points in a column pair as a single, black column, called scolumn (for snake column). It starts from the upper-left corner of the column pair, goes to the highest black point on the right column, then to the second one on the left, etc. A white scolumn is similar, except that it starts from the upper-right point of the column pair. The ith black or white scolumn is the scolumn contained in the ith column pair. An 8 x 8 example is depicted in Fig. 1 . With this in mind we think of the black (white) points of M,, as being arranged in an n x 4 submesh, that is defined by the 4 black (white) scolumns in an obvious way. The actual distance between neighboring points on this mesh is two (two neighbors in the same scolumn may choose arbitrarily one of the two paths of length two between them).

Row 0

Col U Col I Col2 Col 3 Col4 Col5 Col6 Col7

Fig. 1. Column pairs and scolumns in Ms. Adjacent points of the black scolumns 0 and 1 are connected by diagonal lines. Original mesh edges are horizontal and vertical.

A general notion that we will often use is the vibration of a packet. When we say that a packet vibrates up (down, right, or left) we mean that it goes a step upwards (down, right, or left) and then a step back.

The algorithm is composed of two main phases as will follow from the following theorems.

THEOREM 2.2. Assume an arbitrary total order on the packets. Then there is a hot-potato algorithm that simultaneously for each of the two color classes, sorts each class of packets into column major order, along the scolumns. The algorithm terminates in 5n + o(n) steps on M,,.

THEOREM 2.3. Assume that M,, contains packets of distinct destinations such that packets of each color are sorted in column major order along the scolumns according to their destinations by the lexicographic order of (column, row), then the packets can be routed by a hot-potato algorithm in at most 2n + 2 steps. We start by proving Theorem 2.3.

PROOF (OF THEOREM 2.3). Consider any given row of M,. It

1170 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 6, NO. 11, NOVEMBER 1995

contains packets of two colors. Lets concentrate on the black packets. They are sorted by their destination lexicog- raphically by (column, row). Since there are n black points in each black scolumn, all black packets in this row are destined to distinct columns. Otherwise, this would imply that there are more than n black packets that have the same destination column. This however would contradict the assumption that the destinations are distinct. The same reason- ing applies to the white color, leading to the following algorithm which “unpacks” the packets to their destinations.

ALGORITHM MESH-UNPACK We describe the algorithm only for the black packets. The actions on the white packets are the same (in a dual way, where scolumns are concerned). Since a white packet and a black packet can never meet there cannot be any conflicts between these simultaneous actions. Thus one only has to ensure the consistency for the black packets.

PHASE I. For n steps every packet moves horizontally towards its destination column. Once it reaches that place it statts vibrating for the rest of the n steps. The vibration is done as follows. If the destination column is reached at an even step (in other words if the point reached at the destination column is black), then the packet vibrates vertically in its row pair. If the destination column is reached at an odd step, then the packet vibrates horizontally in its column pair.

Observe that since for every row at most one packet of a certain color can have the same destination column, there would be no conflicts in the vibration. What remains to be checked is the possibility of a conflict between a packet p that vibrates horizontally at a location (c, r) and a packet q that is on its way to its destination column and passes through (c, r) . Assume then that p and q are black packets and that q arrives at (c, r ) from the left. Since q wants to pass (c, r ) it follows that its destination column cq > c but then g must have been to the right of p by the assumption that each color is sorted. Thus it cannot be that q arrives at (c, r ) after p . A symmetric argument applies to the case when q arrives from the right. By the above discussion we conclude that there are no conflicts between any two packets. Hence after n steps (an even number) each packet is in its destination column pair. By parity reasons, at the end of these n steps, packets that reached their destination column on odd steps are a place right or left from their destination column. Let us call such packets exile packets. Note also that at this point of time all black packets are at black sites, up to two packets at a site - one that is on its proper column and the other that is an exile packet.

PHASE 11. For a step each exile packet moves to its destination column (i.e., horizontally) while nonexile packets move vertically in their row pair. After this step all black packets are on white sites, at most two at a site and all in their destination columns. Due to the assumption that the black packets are initially sorted in column major order, the following property now holds:

Sorting Property. The packets on each column are parti- tioned into at most two sets that correspond to the different scolumns of their origin (note that indeed all black packets that are destined to any given column originally reside on at most two scolumns by the sorting assumption). One of the sets is “above” the other, that is, all its packets are not below the highest packet of the other set (here “higher” and “lower” correspond to the natural order on the row coordinates). Each packet in the higher set is destined to a lower place than its current location and each packet in the lower set is destined to a higher place than its current location. Moreover, in each of the sets the packets are sorted. That is, if a packet p is higher than a packet 4 (of the same set) then the destination of p is higher than the destination of 4.

PHASE 111. For the next t steps each packet attempts to go towards its destination, that is, vertically in the column. In case of a conflict, the packet having the farthest destination is sent first, while the other one vibrates horizontally (two steps) inside the column pair. Once a packet reaches its destination it is absorbed there. Initially there are at most two packets at each site. At every other step one or two new packets may reach a given node from the vertical direction and at least one of its packets leaves it. If two new packets enter the node at the same step from the vertical directions then two packets leave it vertically in opposite directions in the next step (unless one is absorbed). Thus, in each step a node needs to vibrate at most one packet and there are no horizontal conflicts.

There are also no conflicts between these packets that are going down and these that are going up since they use edges in different directions. Thus we may consider the black packets going up and the black packets going down as independent problems. In the following claim we bound the number of steps required for the black packets going up to reach their destinations. Similar discussions apply to the black packets going down.

CLAIM 2.1. Each packet that is destined to a node which is higher than its location at the beginning of phase ZZZ arrives to its destination in at most n + 1 steps.

The important movement of the packets during Phase I11 (besides vibrating) is along the vertical dimension, inside the destination column. Hence we use the terms “destination” of a packet and its “destination row” interchangeably. Similarly, we refer to “higher” or “lower” nodes corresponding to their higher or lower row coordinates.

PROOF. Let p be a packet whose destination is d,,. Let hl be any node at row dp or lower. Assume p arrives at hl at the Tth step of Phase 111, after moving a vertical distance x. Let B be the set of all (black, going up) packets that initially reside between p and hi, i.e., strictly higher than the initial location of p and strictly lower than h l . Note that these packets will be destined to rows higher than d,,, Let b = IBI. Let E = 1 if there is a packet p‘ that resides initially at the same site as p and with higher destination than p , and E = 0 otherwise.

NEWMAN AND SCHUSTER: HOT-POTATO ALGORITHMS FOR PERMUTATION ROUTING 1171

CLAIM 2.2. T i .x + 0 + 2 ~ .

PROOF. The proof is by induction on T. The base cases for T = 0, 1 , 2, 3 , 4 may be verified by exhaustive scan of all possibilities for x , b, and E giving a certain value of T. Let ho be the initial site of p . For the inductive step assume T > 4, so x > 2. We denote by Tq the time for a (black, going up) packet q to reach hl (so T = q,). There are four basic initial cases to consider.

1) E = 0 and there are no packets at site ho + 2. Then, after two steps, p resides at ho + 2, which is a new situation for p with h‘ I h, x’ = .x - 2, E’ = E = 0. Denote by Ti the time it takes for p to get to 12, given the new situation. Since Ti = Tp - 2 , the induction hypothesis holds for the new situation, and so

Tp = 2+Ti 1 2 + x f + b ’ + 2 e ’ = x + b + 2 ~ .

2) E = 0 and there is a single packet p’ at row ho + 2. For p’: x’ = x - 2 (to get to hl), b’ = b - 1 (6’ is the number of (black, going up) packets between ho+ 2 and h, , and there are no such packets between ho and hl + 2), and t/ = 0. p will get to hl two steps after p’, hence Tp = 2 + T,, I 2 + x’

+ 2, and let y‘ denote the one with a lower destination row. For p‘: x’ = x - 2, b‘ = 0 - 2, and E’ = 1. p will get to row hl two steps after p’, hence q) = 2 + T,, 5x + b = x + b + 2 ~ .

4) E = 1, denote by p’ the other packet that resides with p at ho. For p‘: E’ = 0, x’ = x, and b’ = b. p will reach hl two steps afterp’, hence T,)= 2 + 7;).< 2 + x + b = x + b + 2 ~ .

Since all cases are included in the above four, the proof of

+ b ’ + 2 ~ ‘ = x + b - l < x + b + 2 ~ . 3 ) E = 0 and there are two packets at row

Claim 2.2 is complete. 0

The proof of Claim 2.1 now follows by applying Claim 2.2 for po-the last packet to arrive at its destination. po starts Phase I11 at node ho, and arrives at its destination dpo at time

Tfi,, after traveling vertically a distance xo. Let Bo denote the number of (black, going up) packets between (not includ- ing) and d , at the beginning of Phase III. Let bo = IBoI. Each

packet in Bo is destined to a row higher than dpo , hence bo I (n - I ) - d , . Putting these into the inequality from Claim 2.2

we get T’,, I X O +bo + X = d , - h, +bo + 2

I dpo - h , + ( n - l ) - d p o + 2 = n-h,+l

I; n + l

0

We have shown that Phase I takes n steps, Phase I1 takes a single step, and Phase I11 takes at most n + 1 steps, hence the whole Algorithm MESH-UNPACK takes at most 212 + 2

0 PROOF (OF THEOREM 2.2). We implement a known algorithm

that sorts an m x m array in a snake like order in 3m + o(m) steps [23] (see also [ 151). Here, we use it to simultaneously sort each of the two “virtual” n x 4 submeshes of M.. A

steps. This completes the proof of Theorem 2.3.

phase of the algorithm of [23] is typically to sort a row or a column. We need the following lemma.

LEMMA 2.1. There is a hot-potato algorithm that sorts simultaneously the even places and the odd places of each row in M,, in n steps.

PROOF. We implement in each row of M,, an odd-even transposition sort [lo], see also [15]. The abstract odd-even transposition sort algorithm assumes the ability to compare and exchange two adjacent elements of the same row in a single step. It takes N steps on a linear array of N places, as follows:

1) At odd steps each element at an odd place is compared with its successor and the two elements are exchanged if their relative order contradicts the order of the sort.

2 ) At even steps the same is applied for packets that are at even places.

As shown in [ 151 after N steps the linear array is sorted in linear order.

Recall that each row of M,, contains packets of two colors in alternating places. The hot-potato algorithm simulates the above algorithm simultaneously for both colors by repeating the following four steps for n/4 times. 1) Every odd packet of each color moves a step to the right.

Every even packet of each color moves a step to the left. 2 ) Packets are compared (at sites that contain two packets),

the bigger moves a step to the right and the smaller a step to the left.

3 ) The same as in step 1 where even replaces odd and visa versa.

4) The same as in step 2.

Steps 1 and 2 together simulate step 1) in the odd-even transposition sort for both colors simultaneously. Steps 3 and 4 simulate step 2). Note also that every packet except for the first (last) of each color moves at each step of the above algorithm. The first (last) packets of each color vibrate horizontally for two steps if they are supposed to wait (which causes no conflicts). Hence, the algorithm is according to the hot-potato paradigm. Since each color class in a row contains n/2 packets, and since the algorithm above simulates an odd-even transposition sort for n/2 steps for each color, it ends with each color class being sorted. 0

COROLLARY 2.1. There is an hot-potato algorithm that sorts simultaneously the white scolumns and the black scolumns in M,, in 2n steps.

PROOF. The algorithm and the proof are the same as in Lemma 2.1, except that routing is done inside a column pair along a scolumn. As the number of elements of the same color in a column pair is n the time complexity doubles. cl We proceed with the proof of Theorem 2.2. The following is a “compare and exchange” sorting algorithm for a k x 3 - mesh Mk. The output is sorted in a snake-like order along columns [lS] (the description in [lS] gives a snake-like order along the rows).


PHASE 1. Divide A 4 k into k‘I2 blocks of size k3I4 X 9 and simultaneously sort each block in a snake-like order along the columns.

PHASE 2. Perform a k”4-way shuffle of the rows. In particular, permute the rows so that the k3I4 rows in each block are distributed evenly among the k1I4 horizontal slices (where a horizontal slice is simply a row of blocks).

PHASE 3. Sort each block into snake-like order along the columns.

PHASE 4. Sort each row in linear order. PHASE 5. Collectively sort blocks 1 and 2, blocks 3 and 4, etc., of

each horizontal slice into snake-like order along the columns.

PHASE 6. Collectively sort blocks 2 and 3, blocks 4 and 5, etc., of each horizontal slice into snake-like order along the columns.

PHASE 7. Sort each column in linear order according to the direction of the overall $-cell snake.

PHASE 8. Perform 2k3I4 steps of odd-even transposition sort on the overall snake-like (column-wise) Mk mesh.

In order to apply this as a hot-potato algorithm on M, we take k = n and sort simultaneously each of the black and the white n x 4 submeshes that are defined by the black and the white scolumns. The basic operations (those that are not recursive calls) in the algorithm above are performed as follows. Sorting the rows (in parallel), which is done in Phase 4, uses the algorithm from Lemma 2.1, and takes n steps. Sorting the scolumns (in parallel, Phase 7), uses the algorithm from Corollary 2.1, and takes 2n steps. To perform the shuffle in Phase 2, we observe that each packet has to move only vertically, thus there are no conflicts, and Phase 2 takes n steps. The packets that have to wait at a given location vibrate horizontally inside the column pair. To perform Phase 8 we note that in M,, consecutive elements in the snake like order of each color are at distance two apart. Thus every step of the odd-even transposition sort can be implemented by two steps as in Lemma 2.1. This 4n + o(n) sorting algorithm gives us a snake-like order along scolumns for the two n x 4 interleaved submeshes. In order to get the desired scolumn major order for each of the submeshes, the order along the odd scolumns needs to be reversed. This requires that the two packets at the ith position of a column pair should both be at the (n - 1 - i)th position: either on the same column for each, or both of them have to exchange columns, depending on the parity of i. We can route this “reverse scolumns” permutation vert- cally along columns, and exchange columns (if necessary) when at the destination row. This process is conflict free and takes n steps. Packets arriving earlier vibrate horizontally. Thus the whole scolumn major order sorting takes 5n + o(n).

U This completes the proof of Theorem 2.2.

We can deduce now Theorem 2.1.

PROOF (OF THEOREM 2.1). The theorem follows directly from Theorem 2.2 and Theorem 2.3. There is still a slight difficulty: It is assumed that whenever a packet reaches its desti-

nation it is absorbed there. However, this assumption dis- turbs the sorting process. We can overcome this problem by either preventing the packets from being absorbed during the sorting phase of the algorithm, or by creating “dummy” packets that carry on the header information at processors where the packets are absorbed. These “dummy” packets may be removed from the system in the second phase. 0

m. ROUTING ON THE HYPERCUBE

We denote by C, the n-dimensional Boolean hypercube with 2”-nodes. Each node is identified as an element of { 0, 1 }’L. TWQ nodes are connected if their identifier differ in exactly one position. C,, is a bipartite graph with two color classes: black points-the points of even parity (i.e., even number of 1s in the id)-and white points-the points with odd parity. The distance between any two points of the same color is even.

In the sequel we refer to subcubes of C,? They will always be of the form CF = {uw I U E (0, l}k} for a fixed n - k string w.

For all 2 _< k I n and w E { 0, 1 }’L-k, let be an order of the bIack points on the subcube Cr that is defined as follows: The

ith black element (i 2 0) in this order is 6 a w, where a is the binary representation of i, and 6 E { 0, 1 } such that the parity of the whole string is even. For example, the order Lpl on the black elements (increases from left to right) is

lo001 11; 0001 1 1 1 ; 001 0 1 1 1 ; 101 1 1 11; 0100111; 1101111; 1110111; 011 1111

We denote by L,, this order on C,, i.e., L, = Lt where A is the empty sting. A similar order is defined on the white elements (in which 6ensures that the parity of each string is odd).

A useful property of the order :

FACT. For every w E { 0, 1 }n-k, the order that induces on the and Girl is Gy1 and LEi, respectively. Thus induces on each subcube Cr the order q.

subcubes in particular

THEOREM 3.1. Let F’ be some (partial) permutation on C,. Assume that the black (white) packets of C, are sorted in the order L,, by the lexicographic order of their destination addresses according to F‘, where the nth dimension is the least significant bit. Then there is a hot-potato routing algorithm that completes the routing of 2’ in n steps.

PROOF. The proof uses the following lemma. LEMMA 3.1. Assume that the black (white) packets are sorted

in C, as assumed in Theorem 3.1, then for every 2 I k I n, w E (0 , 1 }n-k the black (white) packets in CF have destinations with distinct kfirst bits.

PROOF. Assume for the contrary that for a certain Cr there are two black packets pi and 122 in CT with the same first k bit string U as their destinations. That is, p1 has destination ziz1

and p2 has destination uz2. We assume w.1.o.g that uzl < uz2 with respect to the lexicographic order. Assume moreover that there are r black packets in C, whose destinations are


bigger then uzl and lower or equal to uzz. By the assumption that no two packets in C,, have the same destination it follows that r I o(zz)-o(zl) 5 2n-k- 1, where o(z) is the rank of z in the lexicographic order. However, since the packets are sorted in the order L,, in C,, these r packets (of which the last is p z ) must reside after p1 at the first r consecutive locations according to L,,. By the definition of L,, the successor of any packet that is in Cr is in C,”(’) where w(1) = w + 1 mod (2”*) and ‘+’ here is on w as an integer represented in binary. Thus, in general, the jth consecutive element to the packet pI (that resides in C,”) resides in C,”(j), where WO’) = w + j mod (2”-k). Thus p z should reside in Cr(r ) and since r 5 2”k- 1 it follows that w(r) f w which is a contradiction.

0 The same applies to white packets.

The lemma suggests the following algorithm:

ALGORITHM CUBE-UNPACK

Each packet goes greedily to its destination while correcting its “mismatched” bits from left to right. More formally: suppose at some step a packet is at location u<w for some k - 1 bit string U, and is destined to location uxkw’ in the hypercube, then at the next step, the packet “corrects its kth bit” by moving to location uxkw’.

Clearly, if there is no congestion, the above algorithm terminates in n steps. Assume, to the contrary, that at some step t there are two packets p and q, that reside at the same node u<w’, and are both trying to correct the kth bit. By the algorithm, both p and q have the same k first bit string, namely u x k , as their k first destination bits. Observe that at the beginning of step t both packets reside in C:$. Since the algorithm changes location bits left to right, and by the assumption of Theorem 3.1, we conclude that both p and q were sorted into C:$ previous to the beginning of Algo- rithm CUBE-UNPACK, and have never left this subcube previous to time t . Thus, by Lemma 3.1, p and q cannot have the same k - 1 first bits in their destination addresses, a contradiction. This completes the proof of Theorem 3.1 since, after correcting, all ‘mismatched’ bits packets are absorbed at their final destination. 0 In order to arrive at the situation assumed by Theorem 3.1

we need some preliminary tools. The following is the well known odd-even merge sort algorithm by Batcher [3]. The algorithm uses a recursive algorithm for merging two sorted lists. We first describe the merge algorithm.

The input is composed of two sorted lists A and B of size L (L 2 2). The ranks of the elements in the lists are in the range 0, ..., L - 1, i.e., the first element in a list is ranked zero.

1) Let ODD(A) (EVEN(A)) be the sorted list that contains the elements of A of odd (even) ranks. Similarly ODD(B) (EVEN(B)) is defined. Similarly apply recursively MERGE(ODD(A), EVEN(B)) and MERGE(EVEN(A), ODD(B)). This results in two ordered lists X and Y.

-

-

2) Compare the ith element of X with the ith element of Y . Make the bigger one the ith element in Y, and make the smaller one the ith element in X .

The result is a sorted list in which the smallest element is the first element of X . The successor of the ith element of X is the ith element of Y. The successor of the ith element of Y is the ( i + 1)th element of X.

The sorting algorithm:

1) Split the input list arbitrarily into two equal sized lists A

2) Merge A and B by the merge algorithm.

We use the Batcher algorithm to sort separately and in parallel the black packets and the white packets according to the order &,, while using the packets’ destinations as the keys. We describe the algorithms with respect to the black packets. Everything is done simultaneously in a similar way for the white packets. We remark that as in the case of the mesh, we do not allow a packet to be absorbed in the sorting process even if it reaches its destination. This is somewhat unnatural but is essential in retaining the desired structure. Alternatively, if a packet reaches its destination it may be absorbed if the processor creates a “dummy” packet that lives for the duration of the sorting phase.

In the following description when we refer to “odd” (“even”) elements in a sorted list, we mean the packets that are at nodes of odd (even) rank with respect to the order.

and B. Sort recursively A and B.

THE MERGE ALGORITHM

We assume that the sorted lists are in C:-l and CkT1 and they are sorted according to and &, respectively. We write C:-] = C2:z U CE2, E E {0, l}. Observe that since each of the lists C,”-, , CA-] is sorted according to in the corresponding subcube, the odd elements of CA-] are in C:!z and the even elements are in C!!2. Moreover, each of these sub- lists is sorted according to &. Similarly for Cnovl, its odd elements are in and the even ones are at C:, .

1) For every n - 3-bit string a, each pair of elements lo- cated at nodes of the form 6 0 1 (an even element of Ck-]) and &OO (an even element of Cf-]) switch places by moving via &XI0 in two steps. Other elements vibrate in the n - 1 dimension (causing no conflicts). Note that this places the odd and even elements as required by step 1 of the abstract merge algorithm. That is, C:-, contains ODD(C:-,) sorted according to LE2 in and EVEN(Ck-,) sorted according to e-2 in C z 2 . A dual situation is in c:-~.

2)Apply the merge simultaneously on each of the n - 1- dimensional subcubes C:-l and CA-] to merge the sorted black lists in subcubesCFz, C;?, and the lists in subcubes Cz!2, C;!2, respectively. This simulates step 1 of the abstract merge algorithm described above.


3 ) Step 2 of the abstract algorithm isjimulated in two steps. Each pair of elements 6al and 6aO are compared by node 6010 and switched (if necessary).

This algorithm is a hot-potato merge algorithm for the black and white packets simultaneously. Its complexity t(n) behaves according to t(n) = 4 + t(n - 1) which gives t(n) = 4n.

THE SORTING ALGORITHM

1) Sort simultaneously the black packets in C:-l, E E {O, 1). 2) Merge the sorted lists in the two n - 1-subcubes using the

THEOREM 3.2. The algorithm above is a hot-potato algorithm that sorts simultaneously the white packets and the black packets on C, in O(n2) steps according to the order f,,.

PROOF. The algorithm indeed simulates the abstract Batcher algorithm in a hot potato manner relative to the order L,,. Its complexity s(n) behaves according to s(n) = s(n - 1) + t(n)

0 Together, Theorems 3.1 and 3.2 directly imply the follow-

ing theorem:

THEOREM 3.3. There is a hot-potato permutation routing algorithm for the 2”-node hypercube which completes the routing in o(n2> steps.

merge algorithm.

= s(n -, 1) + 4n which gives s(n) = O(nz).

Iv. ROUTING ON THE TWO-DIMENSIONAL TORUS

one which is black and the other white. As we know that white and black packets never meet, it is sufficient to describe the whole algorithm for black packets only while keeping in mind that white packets do the same simultaneously. Thus we have type I packets with both coordinates even and type I1 packets with both coordinates odd (these are all black packets). We note that each of the color classes can be viewed as isomorphic to T, with distance 2 between adjacent neighbors.

A column-major (row-major) order on T, is a column-major (row-major) order on the mesh M,, that is obtained from T, by disregarding the wrap around edges.

Theorem 4.1 directly follows from the two theorems below.

THEOREM 4.2. Assume a total order on the packets of T,, then there is a hot potato algorithm that can sort simultaneously all the four color classes of T, the type I packets in column- major order and the type 11 packets in row-major order, in at most 2n + o(n) steps.

Assume a 1-1 mapping (partial permutation of the points) is defined on the packets of T,,. Let fI be the total order on the packets of type I that is defined by the lexicographic order (row, column) of their destination. Let L I ~ be the total order on the packets of type I1 that is defined by the lexicographic order (column, row) of their destination.

THEOREM 4.3. Let P be a (partial) permutation. Assume that the packets of each of the four color classes of T, are sorted, the type I packets in row-major, and the type 11

2

packets in column-major order according to the order SI, and f 11, respectively, in the corresponding subtorus T, . Our goal here is to develop a routing algorithm with a

smaller leading constant for the two dimensional torus. Two 2

advantages are achieved by utilizing the torus connections. Then a hot potato routing of 17 can be completed in at most 2n + 2 steps. 1) On the torus, the packets are colored by four colors rather

than by two as in the mesh algorithm, thus there are four (OF 4.2). We adopt a “compare and ex-

independent routing problems* possible On the mesh because not

by ‘s not processors

change” sorting algorithm for T,, m = n/2, to simultaneously sort the packets of the four classes in a hot-potato

have four neighbors, which is crucial for vibrating the manner. We first describe the ‘LcOmDare and al- packets.

These algorithms use the wrap around edges. 2)The torus has faster sorting algorithms than the mesh.

Denote by T, the n x n torus. In the rest of this section it is assumed that n is divisible by 4. We fix an arbitrary point as the origin and number the rows and the columns consecutively from 0 to n - 1. Thus T,, may be viewed as M,, with additional wrap around links.

Our result is: THEOREM 4.1. There is a hot potato algorithm that can route

any (partial) permutation on T, in 4n + o(n) steps. For the proof we need the following coloring of T,,. We color

each point (c, r) by (c mod 2, r mod 2). That is, we partition the points of T,, into four classes by the parity of the coordinates. Observe that this refines the natural partition of the vertices (as T, with even n is bipartite). We refer to the standard partition of T, into the two color classes as coloring by black and white. Our coloring further partitions each class into two which we refer to as black (white) type I and black (white) type 11. We note also that each row and each column contains only two color classes,

” gorithm. The algorithm is hinted at [15]. As we don’t know of an explicit reference we describe it in detail.

Let k be a natural number that divides N. We start by de- scribing an abstract merge algorithm, that merges k sorted lists of size Nlk each into a sorted list of size N. The algorithm is a generalization of Batcher’s algorithm [3].

1) Concatenate (in an arbitrary order) the k sorted lists into a single list L, preserving the order within each list. Un- shuffle L into k new sets of equal size by: the rth set (for each 0 I r < k ) is { a , [ s = k l + r , I = O , I ,..., f - l } , where a, is the sth element of the combined list L.

2) Sort each of the k new sets into a sorted list.

3) Shuffle the k lists into a single list L‘ by reversing the un- shuffle operation of Step 1 above. Formally, the tth element a,, 0 I t I N - 1 in L‘ is the lth element of the rth list, where t = lk + r.

4)Perform k2 steps of odd-even transposition sort on the list L‘.


CLAIM 4.1. The algorithm above outputs a sorted list.

The proof of the claim can be obtained easily by using the 0-1 principle. We omit further details here.

The abstract algorithm that is described above can be applied on T, to sort it in column-major order in the following way (sorting in row-major order can be done analogously, by switching columns to rows and visa versa). Choose k = mu’ (we assume w.1.o.g that m2/’ is a natural number). Divide T, into k submeshes of sides m4l5 x m4Is = k2 x k2 by dividing each side of T, into & equal slices. Each of the submeshes will serve as a list in the abstract algorithm. Each of the submeshes is sorted in column-major order by any linear time algorithm for sorting on meshes in e(m4l5) steps. By the definition of Step I., each submesh has to distribute its k2 rows, evenly among the k submeshes, k rows per submesh. This can be done by rearranging the rows of the submeshes in each vertical slice, and then sending them to the right submesh by routing horizontally. We use here the wrap around edges, hence this can be done in m steps. Step 3 of the merge is done as follows. Observe that each submesh is the only source of precisely k f i rows, k rows per slice. Thus by rearranging elements in each submesh we bring all the elements that are destined to the k rows of a certain slice into k f i rows. Then the shuffling is completed by unshuffling the rows in each vertical slice and then shuffling horizontally inside each horizontal slice. This brings each submesh row to its place or vertically shifted in its submesh. In the latter case each packet moves vertically inside its submesh to its place. This whole operation is done in m + O(P) steps. Step 4 of the merge is done by odd-even transposition sort in 2k2 steps using the wrap around edges. Thus the whole algorithm takes 2m + O(k2) = 2m + o(m) steps.

Returning to the hot potato routing algorithm, we now apply the above algorithm on T, to sort simultaneously each of the four color classes (in the hot-potato manner) in the correct or- ders as required. As was remarked before each of the color classes can be viewed as being arranged on T, with m = n12 and with distance two between adjacent neighbors. We choose k as before and sort the four color classes of each block simultaneously as required in parallel in e(m2I5) in a hot potato like manner. This can be done for example by the algorithm described in Section I1 which takes linear time relative to the sides of the mesh. One only needs to note that this sorting can be done simultaneously on each of the classes and along different directions-along columns for type I and along rows for type 11. We leave these details for the reader. The unshuffling can be done simultaneously for each type: A type I packet that reaches its row vibrates horizontally until the end of the n/2 steps. The type I1 packets do the analog along rows, As the two types (of black packets) are on disjoint sets of columns and disjoint sets of rows there is no possible conflict. Then a type I packet travels horizontally to its column and vibrates vertically till the end of the next n/2 steps (likewise for the type I1 packets). The shuffle is done similarly (with an additional phase of E3(m4/5) steps for rearranging

within blocks and for vertical corrections in the end) and takes altogether n + o(n) steps. Thus the whole algorithm takes 2n + o(n) steps. U

PROOF (OF THEOREM 4.3). As in the mesh case it is easy to see, that for each row at most two (black) packets of type I have the same destination column. Similarly at most two packets of type I1 on any column have the same destination row. Furthermore, after sorting there is at most one packet at every point in the torus. This motivates the following three stage “unpacking” algorithm:

1) For n steps each packet of type I travels horizontally to its column destination. This is done without using the wrap around edges (similar to Phase I of Algorithm MESH- UNPACK in the mesh). Once a packet reaches its column destination it vibrates in the following way: If it reaches that point after even number of steps it vibrates vertically (the first of the two possible packets upwards and the other downwards). If it reaches that point after odd number of steps it vibrates horizontally. Note that the vibration does not obscure other packets from passing through by the same reason as in the mesh case. An analog thing is done for the type I1 packets along columns. Note also that type I and type I1 packets do not interfere with each other. Ob- serve that after n steps (an even number), each type I point may contain up to four different type I packets: up to two that are on the right column, and up to two that are on adjacent columns and are in the middle of their vibration. Such a point cannot contain any type I1 packets. A similar analog statement is true for type I1 points and packets.

2) Now consider the (black) points (i, j ) (ith row, jth column) of type I and (i + 1 mod n, j + 1 mod n) of type 11. Node (i, j ) may contain up to one packet p that is destined to column j + 1 mod n and (i + 1 mod n, j + 1 mod n) may contain up to one packet q that is destined to row i. As- sume w.1.o.g. that they both exist. Since p vibrates horizontally and q vibrates vertically, they meet at the middle of their vibration at (i, j + I mod n). Node (i, j + 1 mod n) switches p and q at step n + 1, so that at the end of that step q reaches (i, j ) and p reaches (i + 1 mod n, j + 1 mod n). Moreover, q now vibrates horizontally in its destination row, while p vibrates vertically in its destination column. The process described above can be done for all 0 5 i, j 5 n - 1 and for all “directions” simultaneously. Hence at step n + 2 all packets are either in their destination columns or their destination rows. Furthermore, at most two packets at each node are not in their destination row and at most two are not in their destination column.

3) For the next n steps each packet moves horizontally to its destination if it is on its destination row, or vertically in the other case. Initially a processor that contains two packets that are on their destination rows (columns), sends the two packets horizontally (vertically) in the two opposite directions. These directions are now fixed, each packet travels along this initial direction until it finally reaches its destination.

AEORITHM TORUS-UNPACK.


Algorithm TORUS-UNPACK routes all packets to their destinations in at most 2n + 2 steps. This completes the proof of Theorem 4.3, and the proof of Theorem 4.1.

v. CONCLUSIONS AND OPEN PROBLEMS

We gave the first deterministic hot potato routing algorithms that are asymptotically optimal for the two-dimensional grid and torus. Moreover, the constants achieved come close to the best known store-and-forward algorithms. For the n-dimensional Boolean cube we gave an O(n2) hot potato algorithm. This is better than was known before, however it is not asymptotically optimal due to the relatively inefficient sorting algorithm. It might be possible to adopt the faster sorting algorithm of Plaxton [ 191 in order to improve our result.

Our methods generalize also to the d-dimensional mesh with nd nodes (with recursive “Batcher-like’’ sorting). This gives an O(d2n log2 n> hot potato routing algorithm which will not be described here.

It is still open to improve the constants in the two- dimensional grid case, and the asymptotics in the higher dimensional cases. It is a major goal to devise an algorithm with performance related to the maximal routing distance rather than to the diameter of the network. Related results concerning many-to-one routing problems were given on the hypercube [ 1 1 J and on d-dimensional grids [5].

ACKNOWLEDGMENTS

We wish to thank Shai Halevi for helping us with enlighten- ing remarks and to Marc Newman and Amir Ben-Dor who worked with us on hot-potato routing.

This work was supported in part by the French-Israeli grant for cooperation in computer science, and by a grant from the Israeli Ministry of Science.

REFERENCES

[lo] N. Haberman, “Pardlel neighbor-sort (or the glory of the induction principle),” Technical Report AD-759248, Nat’l Technical Information Service, US. Dept. of Commerce, 1972.

[ l 11 B. Hajek, “Bounds on evacuation time for deflection routing,” Distrib- uted Computing, vol. 5, pp. 1-6, 1991.

[I21 W.D. Hillis, The Connection Machine. Cambridge, Mass.: MU Press, 1985. [13] C. Kaklamanis, D. Krizanc, and S. Rao, “Hot-potato routing on proces-

sor arrays, “ Proc. ACM Symp. Parallel Algorithms and Architectures,

[14] M. Kaufmann, H. Lauer, and H. Schroder, “Fast deterministic hot- potato routing on processor arrays,’’ ISAAC, 1994.

[ 151 F.T. I-eighton, Introduction to Parallel Algorithms and Architectures. San Mateo, Calif.: Morgan Kaufmann, 1991.

[16] D.H. Lawrie and D.A. Padua, “Analysis of message switching with shuf- fleexchanges in multi-processors,” Interconnection Networks, 1984.

[I71 N.F. Maxemchuk, “Comparison of deflection and store and forward techniques in the manhattan street and shuffle exchange networks,” IEEE INFOCOM, pp. 800-809, 1989.

[18] L Newman and A. Schuster, “Hot-potato algorithms for permutation routing,’’ Technical Report LPCR #9201, CS Dept., Technion, Nov. 1992.

1191 G. Plaxton, “Load balancing, selection and sorting on the hypercube,” Proc. ACM Symp. Parallel Algorithms and Architectures, pp. 64-73, June 1989.

[20] R. Prager, “An algorithm for routing in hypercube networks,” PhD thesis, Univ. of Toronto, 1986.

[21] C.L. Seitz, ‘The Caltech Mosaic C: An experimental, fine-grain multi- computer,” Fourth Ann. ACM Symp. Parallel Algorithms and Architec- tures, keynote speech, San Diego, June 1992.

[22] B. Smith, “Architecture and applications of the HEP multiprocessor computer system,” Proc. (SPIE) Real Time Signal Processing IV,

[23] C. Schnorr and A. Shamir, “An optimal sorting algorithm for mesh connected computers,” Proc. 18th Symp. Theory of Computing,

[24] T. Szymanski, “An analysis of hot potato routing in a fiber optic packet switched hypercube,” Proc. IEEE INFOCOM, pp. 918-925, 1990.

[25] Z. Zhang and A.S. Acampora, “Performance analysis of multihop lightwave networks with hot potato routing and distance age priorities,” Proc. IEEE INFUCOM, pp. 1.012-1,021, 1991.

pp. 273-282,1993.

pp. 241-248, 1981.

pp. 255-263, 1986.

A.S Acampora and S I A Shah, “Multihop lightwave networks. A companson of store-and-forward and hot-potato routing,” IEEE INFU-

P Baran, “On distnbuted communications networks,” IEEE Trans. Comm , pp 1-9, 1964 K Batcher, “Sorting networks and their applications,” Proc. AFIPS Spring Joint Computing Conf , vol 32, pp 307-314, 1968 A Borodin and J E Hopcroft, “Routing, merging, and sorting on parallel models of computation,” J Computer and System Sciences, vol 31, pp 130.145, 1985 A. Bendor, S Halevi, and A Schuster, “On greedy hot-potatoe rouung,” Technical Report LPCR #9204, CS Dept., Technion, Jan 1993. A Bar-Noy, P Raghavan, B. Schieber, and H Tamaki, “Fast deflection routing for packets and worms,” Proc 12th ACM Symp Principles Distributed Computing, 1993 U Feige and P Raghavan, “Exact analysis of hot-potato routing,” Proc IEEE Symp Foundations of Computer Science, Nov 1992. A G Greenberg and J Goodman, ‘‘Sharp approximate models of adap- tive routing in mesh networks,” Teletraffic Analysis and Computer Performance Evaluation, 0 J Boxma, J W Cohen, and H C Tijms, eds Amsterdam Elsevier, 1986, revised 1988 AG. Greenberg and B Hajek, “Deflection routing in hypercube networks,” IEEE Trans Comm , June 1992

COM, pp 10-19, 1991

nan Newman received his PhD in computer science from the Hebrew University of Jerusalem in 1991 He is currently a lecturer at Hafa University, Israel His man interests are computauonal complexity, combi- natonal and routmg algonthms, opucal computahon and communicabon, and circuit complexity

Assaf Schuster received his BA, MA, and PbD degrees in computer science from the Hebrew University of Jerusalem, the latter one in 1991 He is currently a lecturer at the Technion (Israel Insti- tute of Technology) His main interests include networks and routing algonthms, parallel and distnbuted computation, optical computation and communication, and dynamically reconfigunng networks

Hot-potato algorithms for permutation routing

Documents

Transcript of Hot-potato algorithms for permutation routing