Scheduling pipelined communication in distributed memory multiprocessors for real-time applications

10
Scheduling Pipelined Communication in Distributed Memory Multiprocessors for Real-time Applications Shridhar B. Shukla Code EC/Sh, Dept. of Elect. & Computer Engineering Naval Postgraduate School Monterey, CA 93943-5000 and Dharma P. Agrawal Computer Systems Laboratory Dept. of Elect. & Computer Engineering North Carolina State University Raleigh, NC 27695-7911 Abstract: This paper investigates communication in distributed memory multiprocessors to support task- level parallelism for real-time applications. It is shown that wormhole routing, used in second generation mul- ticomputers, does not support task-level pipelining be- cause its oblivious contention resolution leads to out- put inconsistency in which a constant throughput is not guaranteed. We propose scheduled routing which guarantees constant throughputs by integrating task specifications with flow-control. In this routing tech- nique, communication processors provide explicit flow- control by independently executing switching sched- ules computed at compile-time. It is deadlock-free, contention-free, does not load the intermediate node memory, and makes use of the multiple equivalent paths between non-adjacent nodes. The resource al- location and scheduling problems resulting from such routing are formulated and related implementation is- sues are anal yzed. A comparison with wormhole rout- ing for various generalized hyp ercubes and tori shows that scheduled routing is effective in providing a con- stant throughput when wormhole routing does not and enables pipelining at higher input arrival rates. 1 Introduction Mapping an application on multicomputers involves partitioning, task allocation, node scheduling, and message routing. These steps are interdependent. For example, locations of the sources and destinations of messages, which determine the possible paths, are fixed by task allocation. Shrzdhar Shukla is supported in part by the Research Initiation Program of NPS and Dharma Agrawal k supported in part by NSF grant No. MIP-8912767. Permission fo copy without fee all or part of this material E granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice Is given that copying Is by permission of the Association for Computing Machinery. To copy otherwise. or to repubhsh, requires a fee and/or speclflc permission. Similarly, node scheduling affects the instants for mes- sage injection, and thus affects channel contention. In spite of such interdependence that determines the final system performance, researchers have typically investi- gated these steps separately. For example, routing al- gorithms are analyzed to optimize the network latency and not the system-level performance [Da187, Nga89]. The traffic generated by a node and its frequency of communication with others are assumed identical for all nodes. While many scheduling and alloca- tion strategies make use of deterministic models[Lo88], routing algorithms are studied for statistically gen- erated traffic patterns. Although such separate con- siderations may be justified to make the algorithms amenable to analysis, they are unjustified in real-time situations where system-level performance is of pri- mary importance. When input data arrives periodically and a se- quence of processing steps must be carried out on each input set, as in artificial vision, task-level pipelin- ing has been shown to be effective [Agr86]. It refers to periodic invocation of a set of tasks. For its suc- cessful implementation, the mapping procedure must ensure that the multicomputer supports a processing rate equal to the input arrival rate. However, worm- hole routing (WR), used in second generation mul- ticomputers such as the Intel iPSC/2 and Symult Se- ries 2010, is oblivious to the application requirements; and hence, cannot guarantee a constant throughput for task-level pipelining. We propose scheduled routing (SR) which in- tegrates task specifications with flow-control by exe- cuting node switching schedules computed at compile- time and guarantees constant throughput. The re- source allocation and scheduling problems resulting from such routing are formulated and related im- plementation issues are analyzed. A comparison with WR for various generalized hypercubes (GHC’S) [Agr86] and tori shows that SR is effective in providing a constant throughput when JVR does not and enables pipelining at higher input arrival rates. The organi- zation of this paper is as follows. A graphical model 01991 ACM 0-89791 -394-9/91/0005/0222 $1.50 222

Transcript of Scheduling pipelined communication in distributed memory multiprocessors for real-time applications

Scheduling Pipelined Communication in Distributed MemoryMultiprocessors for Real-time Applications

Shridhar B. Shukla

Code EC/Sh, Dept. of Elect. & Computer Engineering

Naval Postgraduate School

Monterey, CA 93943-5000

and

Dharma P. Agrawal

Computer Systems Laboratory

Dept. of Elect. & Computer Engineering

North Carolina State University

Raleigh, NC 27695-7911

Abstract: This paper investigates communication in

distributed memory multiprocessors to support task-

level parallelism for real-time applications. It is shown

that wormhole routing, used in second generation mul-

ticomputers, does not support task-level pipelining be-

cause its oblivious contention resolution leads to out-

put inconsistency in which a constant throughput is

not guaranteed. We propose scheduled routing which

guarantees constant throughputs by integrating task

specifications with flow-control. In this routing tech-

nique, communication processors provide explicit flow-

control by independently executing switching sched-

ules computed at compile-time. It is deadlock-free,

contention-free, does not load the intermediate node

memory, and makes use of the multiple equivalent

paths between non-adjacent nodes. The resource al-

location and scheduling problems resulting from such

routing are formulated and related implementation is-

sues are anal yzed. A comparison with wormhole rout-

ing for various generalized hyp ercubes and tori shows

that scheduled routing is effective in providing a con-

stant throughput when wormhole routing does not and

enables pipelining at higher input arrival rates.

1 Introduction

Mapping an application on multicomputers involves

partitioning, task allocation, node scheduling, and

message routing. These steps are interdependent. For

example, locations of the sources and destinations of

messages, which determine the possible paths, are

fixed by task allocation.

Shrzdhar Shukla is supported in part by the Research

Initiation Program of NPS and Dharma Agrawal k

supported in part by NSF grant No. MIP-8912767.

Permission fo copy without fee all or part of this material E grantedprovided that the copies are not made or distributed for direct commercialadvantage, the ACM copyright notice and the title of the publication andits date appear, and notice Is given that copying Is by permission of the

Association for Computing Machinery. To copy otherwise. or to repubhsh,requires a fee and/or speclflc permission.

Similarly, node scheduling affects the instants for mes-

sage injection, and thus affects channel contention. In

spite of such interdependence that determines the final

system performance, researchers have typically investi-

gated these steps separately. For example, routing al-

gorithms are analyzed to optimize the network latency

and not the system-level performance [Da187, Nga89].

The traffic generated by a node and its frequency

of communication with others are assumed identical

for all nodes. While many scheduling and alloca-

tion strategies make use of deterministic models[Lo88],

routing algorithms are studied for statistically gen-

erated traffic patterns. Although such separate con-

siderations may be justified to make the algorithms

amenable to analysis, they are unjustified in real-time

situations where system-level performance is of pri-

mary importance.

When input data arrives periodically and a se-

quence of processing steps must be carried out on

each input set, as in artificial vision, task-level pipelin-

ing has been shown to be effective [Agr86]. It refers

to periodic invocation of a set of tasks. For its suc-

cessful implementation, the mapping procedure must

ensure that the multicomputer supports a processing

rate equal to the input arrival rate. However, worm-

hole routing (WR), used in second generation mul-

ticomputers such as the Intel iPSC/2 and Symult Se-

ries 2010, is oblivious to the application requirements;

and hence, cannot guarantee a constant throughput

for task-level pipelining.

We propose scheduled routing (SR) which in-

tegrates task specifications with flow-control by exe-

cuting node switching schedules computed at compile-

time and guarantees constant throughput. The re-

source allocation and scheduling problems resulting

from such routing are formulated and related im-

plementation issues are analyzed. A comparison

with WR for various generalized hypercubes (GHC’S)

[Agr86] and tori shows that SR is effective in providing

a constant throughput when JVR does not and enables

pipelining at higher input arrival rates. The organi-

zation of this paper is as follows. A graphical model

01991 ACM 0-89791 -394-9/91/0005/0222 $1.50 222

of computation for task-level pipelining is described in

Section 2. The effect of WR on throughput and condi-

tions that lead to output znconszstency are described in

Section 3. Communication requirements for pipelining

a set of tasks and an introduction to SR are presented

in Section 4. Computation of an explicit flow-control

schedule to implement SR is presented in Section 5.

Performance results for 64 node GHC’s and tori are

presented in Section 6. The paper concludes with re-

marks on SR and suggestions for further research.

2 Task-level Pipelining

A task-flow graph (TFG) is a directed acyclic graph

whose vertices represent tasks and the directed edges

represent messages between tasks. Each task is a set

of operations executed sequentially. A message con-

sists of data or information transferred between tasks.

A TFG is specified as a 2-tuple of sets {ST, SM}.

S7’={T1, TZ, ..., TN, } is the set of IVt tasks. Each Ti

is associated with Ci, the number of operations to be

performed to execute T~. SM = {Ml, AK2, . . . . iVtN~ }

is the set of N~ messages between tasks. Message Mi

contains rni bytes and is sent from task Tis to task T~d.

T%d cannot be started before Mi reaches it. SM defines

a precedence relationship over ST such that Tj < Tk

if there is a path from Tj to Tk. This model treats

identical messages from a task as distinct at the appli-

cation level if they are destined for different tasks. In

task-level pipelining, the TFG is executed periodically

and a task sends messages to other tasks at the end,

and not during, its execution. TFG execution begins

when tasks with no preceding tasks, referred to as zn-

put tasks, begin upon receipt of an external signal such

as input arrival. It is completed when all tasks that

do not precede any other, called output tasks, com-

plete. The start of a TFG execution by input arrival

is called as an invocation. Depending upon the in-

terval between the start of successive invocations, the

task sizes, and precedence relationships among tasks,

different tasks execute in different invocations at the

same time.

Assume a link bandwidth B and processing speed

si for task Ti. Let r. be the processing time for

the longest task and rm be the transmission time

for the longest message. In general, partitioning

techniques attempt to minimize the communication

overhead [AJ88]. In large-grain parallelism repre-

sented by TFG ‘s, it is reasonable to assume that

the longest task is longer than the longest message.

Maximum throughput is obtained when the input

arrival period IFin = Tc. The minimum latency

for an invocation can be computed in terms of the

critical path length. Consider a sequence SCP =

{Z,, MJ,,Ttz, Mj,,, Z~, Mj~,Z~+l,..., ZL} of

tasks and messages such that Til Ls an input task, TiL

is an output task, and Ti~ + Ti~+l , I< L- <(L-l).

The sum, A, of the task execution and message trans-

fer times in Scp is called the critical path length of

the TFG if it is maximum among all such sequences

and SCP is called the critical path. Note that there

may be several critical paths in a TFG. The minimum

time for completion of an invocation is A. Let t;(.)

denote the time at completion, in ~“th TFG invocation,

of a task or message transmission that appears in the

parenthesis. Similarly, let t!(.) denote the start of such

events in the jth invocation. Since all input tasks start

executing simultaneously, the latency for invocation j,

223

~j, is computed as rna%($ (Ti)) – t{ (Tk ) where Tk is

any input task and Ti k any output task. If &t is the

interval between completions of the jth and (j – I)th

invocations, then

&t = Tin + ~j – ~~-1

If invocation period is Tin, the TFG is pipelined suc-

cessfully if

~U~=Ti~Vj=l,2,3,... (1)

Obviously, Tin > ~c; otherwise, there will be an infinite

accumulation at the input of the slowest task. Eq. (1)

states that the interval between arrival of two inputs

must be the same as that between the generation of

two outputs for any pair of successive invocations. If

this condition is violated, the TFG pipelining is de-

fined to have output inconsistency (01).

We have considered an application from the do-

main of artificial vision since task-level pipelining is ef-

fective in obtaining real-time vision [Agr86]. As a spe-

cific example, the DARPA vision benchmark (DVB)

for evaluating parallel architectures which performs

model-based object recognition of a hypothetical ob-

ject, has been used [WRHR88]. If the image data ar-

rives periodically, a recognition result must be com-

puted periodically. A TFG for n object-models is

shown in Fig. 1 1 [Shu90].

3 Inconsistency with “Worm-

hole Routing

Wormhole routing is a blocking variant of cut-through

routing and imposes deterministic path selection via

its routing function to guarantee freedom from dead-

locks. When two messages require a link simulta-

neously, its flow-control hardware resolves contention

using a first-come-first-served (FCFS) policy [Da187].

1 The number of operations is estimated from the data s.uP-

plied with the sequential implementation and the data trans-

ferred is estimated from the size of data structures.

For task-level pipelining, the latency for each invoca-

tionisdetermined by the delays encountered bymes-

sages in that invocation. Given a task allocation, the

predetermined routing function determines the shared

links and associated messages. Since messages of dif-

ferent invocations co-exist in task-level pipelining is

implemented, this service policy results in delays that

are unequal over successive invocations, causing 01. It

results if the conditions given below are true with re-

spect to the routing function, TFG messages, and the

invocation period.

Claim: Let Ml from TI. to Tld and M2 from Tz, to

T.2a be two messages such that Tlij + T2. and all four

tasks are in the critical path. If the paths assigned to

Ml and M2 have a lank in common and Tin is such

that t~(M2) < t~(Ml) + ri~ < t~(M2), then output in-

consistency is present.

Proof: Since M2 of invocation O is available for trans-

mission before Ml of invocation 1, it gains control

of the shared link. Transmission of Ml in invoca-

tion 1 gets blocked until M2 relinquishes the shared

link. The delay in starting Ml in invocation 1 is

6; = t~(M2) – t~(Ml) – Tin (An implicit assump-

tion here is that M2 continues to use all its links

until it is received at the destination. This is justi-

fied since the transmission time is insensitive to the

distance after path setup and messages in large-grain

parallelism represented by TFG’s are sufficiently long

for the transmission delay to dominate the propaga-

tion delay.) Thus, if delay b;, in arrival of Ml at

Tld postpones the start of Tld in invocation 1, it

will also affect the start of T2~ in invocation 1 since

Tld < Tz,. Therefore, t~(Mz) = rin + t~(M2) + ii;

Thusl the output of invocation 1 takes 6; longer to ap-

pear than the output of invocation O. If 6; is such that

t: (Mz) >2 ~,. + t~(Ml), then message Ml of invoca-

tion 2 is available for transmission before M2 invoca-

tion 1. Thus, t~(M1) < t~(M2) and M? k not delayed.

Therefore, start of T~d is not delayed in invocation 2.

This cycle repeats and in alternate invocations, start

of Tld is delayed. As a result, successive outputs are

generated after unequal intervals. If 6; is such that

tl(M2) < 2~in+ , i (Ml) becomes less thantO(Ml) then ~,

t~-l(Mz) after ,more than 1 revocations. Eventually,

when it does, t;(Ma) is not delayed and the cycle re-

peats. This shows that, depending upon the TFG and

multicomputer parameters, the FCFS flow-control in

WR leads to output inconsistency when used for task-

level pipelining. ■

Although we have considered only a pair of in-

teracting messages above, when a complex TFG is

mapped on a large network, several messages may in-

teract to cause 01. Even when path selection is sen-

sitive to the network load and makes use of the mul-

tiple equivalent paths in the network, as in adaptive

cut-through routing [Nga89], 01 may result. For ex-

ample, consider three messages Ml from Tl$ to Tld,

M2 from T2~(= T1. ) to T2d, and M3 from T3, to T~d

such that Tzd < T3~, and Tzd, T3~, Tsd are in the criti-

cal path. Assume that task allocation makes messages

Ml and M3 between adjacent nodes and M2 between

non-adjacent nodes. If message Ml blocks the first

link in one of the paths of M2 and the adaptive flow-

control commits it to a path that contains a link used

by M3, an argument similar to the one above can be

given to show that 01 results.

4 Scheduled Routing

In WR, messages which are less important to the com-

pletion of the current invocation may receive priority

over others. SR prevents this by exercising explicit

flow-control which guarantees constant throughputs.

In this technique, path selection and flow-control are

based on TFG communication requirements derived

using Eq. (1). For maximum throughput (rin = ~C),

each task is executed once every rc time units and it

must receive messages from its preceding tasks at the

same rate. Since each task begins as soon as messages

from all preceding tasks are received, in the Oth invo-

cation, Mi is available for transmission at r: = t; (Ti~ )

and must complete transmission by ct~ = r: + ~C.

Pipelining of periodically arriving messages is possi-

ble if each message, M~, is assigned a release-time,

ri = t!(Ti. ) and a deadline, di = r= + t~(Tj$). Mi

must be transmitted in interval [ri, di] if dz > ri or in

[0, di] and [ri, Tin] when di < ri to guarantee its dead-

line when the TFG is pipelined with period 7,. ~ rc.

By allowing each message transmission to be as long as

the longest task, latency may increase, but the max-

imum possible throughput remains the same. Since

all messages are generated with the same frequency,

these time bounds enable consideration of all succes-

sively generated messages between any pair of com-

municating processors by observing only a single time

frame of [0, rin]. Therefore, multiple messages in tran-

sit between a pair of processors need not be taken into

account explicitly. These time-bounds lay the founda-

tion for SR and are used with path selection and flow-

control by computing communication schedules to be

executed independently by communication processors

(CP’S) at the rnulticomputer nodes.

4.1 Node Switching Schedules

SR transmits a message only when there is a clear

path between its source and destination. Since each

message has an unobstructed path, the run-time flow-

control overhead evaporates and deadlocks become a

non-issue. Such transmission is achieved by compile-

224

time determination of a path for Mi in terms of its

source and destination addresses and its time bounds.

Consider a communication schedule, Q, that spec-. .

ifies a node swltchlng schedule, wi, for each node

iVi of the multicomputer. wi specifies the connections

between input and output channels to be setup in the

CP at node N, over [0, -rin]. The node switching sched-

ules are such that execution of wi at Ni results in a

clear path for each message for the required duration

within its timing constraints.

Fig. 2 shows a functional block diagram of the CP.

The controller sets up the crossbar and multiplexer

according to the switching schedule. If the topology

has degree n, each CP has n + 1 input and n i- 1 out-

put connections, n of which are to and from adjacent

nodes. It also connects to the input/output buffers

the local application processor (AP). It has separate

buffers for each input and output channel enabling it

to simultaneously transmit and receive messages on

separate channels for its AP. A message coming from

a neighboring CP, however, can be sent on only one

outgoing channel. By simultaneously executing these

schedules at all CP’S, clear paths are established for

delivering each message within its timing constraints.

Computation and execution of wi’s assumes the basic

time unit to be the time for a single packet transmis-

sion and ignores the propagation delay in comparison

with the transmission delay. Since we are exploiting

task-level parallelism, the message size is assumed to

be large enough for the transmission delay to domi-

nate. The switching time of each CP is assumed to be

negligible. The network links support bidirectional,

half-duplex traffic and the time taken to reverse the

direction of traffic on a link is also assumed negligible.

Every packet of a message travels along the same path.

SR exploits multiple equivalent paths between non-

adj scent nodes so that none of the network links is

overloaded in any part of [0, qn]. Path assignment

to messages should be such that the traffic is spread

out evenly over the network and time. When the time

available for transmission of a message is greater than

its length, it has a slack which can be used in com-

puting wi’s Thus, a message is allocated to different

intervals of [0, r%n] based upon its links and timing

constraints. We refer to this problem of partition-

ing a message among sub-intervals of [0, Tin] as the

message-intervaI allocation problem. In addition,

simultaneous availability of all the links for a message

must be ensured. Therefore, all the message alloca-

tions to a single interval are scheduled explicitly for

transmission within the interval, We refer to this prob-

lem as the interval scheduling problem. Thus, the

process of computing Q must be carried out using the

steps shown in Fig. 3. The possibility y of feedback in

this process is discussed in the last section.

5 Flow-control Scheduling

5.1 Path Assignment

An optimal path assignment is one that leads to a fea-

sible 0 if one exists. An Q is feasible if it does not

require any link of the network to exceed its capacity

and ensures that no link is required by two messages

at the same time. To find such an assignment requires

solution of the message-interval allocation and inter-

val scheduling problems for each assignment. This is

prohibitively expensive for any reasonable size of the

network. It is possible, however, to eliminate certain

path assignments by checking if they satisfy the nec-

essary conditions for subsequent computatic,n of a fea-

sible Q. Thus, the net utilization of each link over the

invocation period and its intervals gives a criterion to

accept or reject a path assignment.

Given message time bounds, two different utiliza-

tions may be defined. The distinct ri and di values of

all messages divide [0, Tin] into non-overlap ping inter-

vals. Letto=O<t1<t2 <ts<...<tK=r~n

be the K distinct endpoints of intervals in which mes-

sages are available for transmission. Interval [tk _ 1, tk]

is denoted by Ak and a message is active in Ak if it

is available for transmission during [tk_1,tk].Let A =

[ai~] be an Nm x K matrix, called the message activity

matrix, such that aik = 1 if Mi is active in Ak and O

otherwise. For each message,

(2)

An equality in (2) indicates that Mi has no slack. A

no-slack message utilizes the links in its path fully in

its active intervals and no other message can use them

during these intervals. On the other hand, if (2) is

an inequality, links used by Mi have room for carry-

ing other messages during its active intervals. The

net utilization over [0, Tin] of links used by messages

with slack can be determined, but their utilization

profile cannot be determined unless message-interval

allocation is done. If there are N/ links in the topol-

ogy, define an Nm x Nl path assignment matrix as

B = [bij] such that btj = 1 if message Mi uses link Lj

and bij = O otherwise.

Definition 5.1 Link Utilization, U;, for link Lj is de-

fined as the ratio of the sum of transmission times of

ail messages carried by Li and the sum of lengths of

all intervals in which at least one message M active on

Lj . Thus, if Qj as the set of values of index k such

that 3 a message Mi with a~kb~j # O,

225

U; < 1 implies that the total transmission require-

ment of all the messages using link Lj does not exceed

the total time in which it is used. It, however, does

not imply absence of hot-spots characterized by com-

bination of no-slack messages using a link in the same

interval. A path assignment must also eliminate such

hot-spots in the network. Therefore, spot utilization

may be defined as below.

Definition 5.2 Spot utilization, U~k, is defined as the

number of no-slack messages using Lj in the interval

[tk-1, tk]. Thus, u~k = ~~~ aikbij V Mi for which

(2) is an equality.

The path assignment must be such that U;k < 1

for all the spots. To increase the probability of finding

a feasible Q, path assignment minimizes

where the maximum is taken over all possible values

of j and k. If U < 1,SR can be attempted; otherwise,

the TFG communication requirements exceed the link

capacity.

Optimal Path Assignments

Since the number of alternative paths for a message

increases very rapidly with the number of hops, a large

number of distinct path assignments must be inspected

to find the one with the lowest U. For example, if there

are .z messages with at least two alternative paths, the

tot al number of distinct path assignments exceeds 2“.

Therefore, we provide a heuristic for path assignment

that minimizes the peak utilization.

The algorithm AssignPaths of Fig. 4 performs it-

erative improvement to reduce U. An assignment is

improved in the following manner. If U is due to a

hot-spot, say (Lj, Ak), all the messages that are active

in the Ak and use Lj are identified. When U is due

to a link, all the messages that use it over any of the

K intervals are identified. For each multi-hop message

among these, all the alternative paths are considered

and the one that causes the largest drop in U is iden-

tified. Then, the message-path combination (called a

reroute in Fig. 4) that causes the largest drop among

all multi-hop messages is selected and the message is

reassigned to its new path. It may not be possible to

change U by any reroute except to make the same U

value appear on a different link and/or spot. In this

case, the reroute that causes this change is made so

that the algorithm moves to a different point in the

link-interval space. This helps in reducing U when

there are multiple appearances of the same value in

different links and/or spots. The random assignment

of paths at the end of each iteration helps the algo-

rithm slide out of any local minima.

Performance of AssignPaths for DARPA Vision

Benchmark (DVB) TFG on 64 node GHC’S and tori

is shown in Figs. 5 and 6. A common method of as-

signing paths to messages between two nodes which

also yields deadlock-free routing in GHC’S is to pro-

ceed along the path realized by changing the source

node address starting from the lowest dimension or

the least significant digit (LSD) and proceeding to-

wards the highest one or the most significant digit

(MSD) until it becomes same as the destination ad-

dress. Clearly, the utilization of links is uneven in

LSD-to-MSD path selection and balanced in that given

by algorithm AssignPaths. Note that the minimum

possible peak utilization changes as the input period,

rim, changes because the message time-bounds change.

We have, therefore, plotted the improvement in peak

utilization for various values of the network load which

is defined as ~. Thus, when the TFG is pipelined

with maximum input arrival rate, the network load is

maximum. We observe that the utilization achieved

by AssignPaths is at least as low as the LSD-to-MSD

routing.

5.2 Message-interval Allocation

Message-interval allocation can be performed on dis-

joint subsets of TFG messages that are identified by

the following definitions.

Definition 5.3 Two messages MP and Mq are related

if and only if either 3 a link Li and an interval Ak such

that bPi = bqi = 1 and aPk = a~k = 1 or ~ a message

M, which is related to both MP and Mq. ■

Definition 5.4 A set of messages, SR, is mazimal if

and only if ‘d Mi, Mj E SR, Mi and Mj are related

and ~ Mk @’SR, such that Mk and hfi are related. ■

The transitivity and closure of the relation between

messages defined above partitions SM into disjoint

subsets. Let S; denote the ith maximal subset. The

maximality of each subset ensures that S~ (7 ,S’~ = @

for i # j. Such subsets of SM can be obtained by per-

forming row operations on an identity matrix and the

matrices A and B.

Consider a maximal subset SR with Q. =

{nl, nz, . . .} as the indices of messages in it. Also, let

Q~ = {~1, ~2, . . .} ad Q~ = {LI, kz, . . .} be the indicesof the links and intervals used by messages in SR. Let

Xhj denote the time for which Mnh is transmitted in

Ak,, i, e., between [t(k, _I), fk,]. A feasible allocation

of each message to the intervals in which it is active is

obtained by assigning values to Xhj (~ O) such that

(3)

226

x &hkJhthr,xhj ~ tk,‘tk,–l

nhCQ.b’liGQl,kjEQk (4)

Constraints (3) require that the total allocation for

each message equals its duration at the given band-

width. Constraints (4) require that the total alloca-

tion of all messages to a link in each interval does not

exceed the interval length. This allocation problem

is analogous to scheduling periodic tasks on multiple

processors [LM8 1]. The similarity between the two

formulations is that the messages constitute periodic

tasks and links constitute processors. The difference

is that a message may simultaneously require several

links, but a task must be processed by exactly one pro-

cessor at any time. If the set of constraints (3) and (4)

is feasible, a suitable message-interval allocation can

be obtained. The feasibility of such systems can be

checked by standard integer programming techniques.

The values of Xhj given by a feasible solution are then

used to initialize an iV~ x K message-interval alloca-

tion matrix, P = ~~k], where p~~ gives the time for

which illfi is transmitted in Ak.

5.3 Interval Scheduling

The outcome of message-interval allocation is informa-

tion about how much of each message is transmitted

in an interval. Explicit flow-control requires all the

links assigned to a message to be available simulta-

neously so that a clear path from the source to des-

tination is established. Therefore, interval scheduling

must be performed for each interval used by each max-

imal subset to construct Q. This problem is one of

preemptive scheduling of tasks on a set of identical

processors where a task may require simultaneous use

of more than one processors. The set of links used

in an interval could be considered as the set of pro-

cessors and messages with non-zero allocations could

be considered as the tasks. An integer programming

formulation for this scheduling problem has been de-

scribed in [BDW86] and is adapted to this application

as described below.

Let Q: = {nl, nz, . . .} be the indices of messages

in a maximal subset that have non-zero allocations in

Ak. Let Q? ={il, iz, . . .} be the indices of links used

by messages in Q!. We first define a link feasible set

of messages.

Definition 5.5 A set of messages with indices QIjS =

{nl, nz,...} is MC feasible if ~ ni, nj G Q,~, such thatMn, and IVIn, use the same link. ■

Clearly, messages in a link feasible set can be trans-

mitted simultaneously. Let IVrf, be the number of

link feasible sets possible for Q:. Associate a variable

yj with jth link feasible set representing the time for

which all the messages in it are transmitted simultane-

ously. Let Qi be the set of indices of link feasible sets

of messages in Q: which contain message ill~. Con-

sider the integer programming problem of minimizing

Ntf ,

E D~=l’~k’”Q:yj so that

j=l jEQ,

If ~~~~ yj < Ak, then the interval is schedulable;

otherwise, the messages with nonzero allocations in

Ak require more time than the interval length. Once

again, this problem may be solved as a standard inte-

ger programming problem. The number of constraints

is proportional to the number of messages in the set

but the number of variables is O(Nlf, ). Depending

upon the particular path assignment for the messages

in Q:, the number of variables may be O(21rk ) where

IVk is the number of messages in Q:.

5.4 Computation of Node Switching

Schedules

Solution of the interval scheduling problem for each in-

terval of each maximal set enables derivaticm of Ui ‘s.

A typical scheduling command in wi contains the time

of execution and the sources of the data for the input

buffers and output channels. The sources are the in-

put channels and out put buffers. The link feasible sets

could be transmitted in any sequence in an interval.

Given a sequence, the starting time for messages in a

set is determined by adding the total time for which

sets before it are scheduled to the starting time of the

interval. This time is then used to derive the schedul-

ing commands. The maximum number of scheduling

commands in an Wi is the number of packets that can

be transmitted in [0, ~in]. Synchronization of CP’S in

the path of a message required by such routing can be

achieved by periodic synchronizing messages [Shu90].

6 Performance of Scheduled

Rout ing

In order to compare the performance of SR with WR,

we have simulated the latter and computed the for-

mer for various tori and GHC’S. Our objective is to

observe the 01 due to WR for various values of load

on the network and check if Q is obtained for the same

value of the input period. If it is, the throughput cor-

responds to Tin and the Iatency to the criticaI path

length obtained after assigning time bounds to all the

messages.

Normalized load is defined as the ratio of the min-

imum possible input arrival period and the input ar-

rival period, i. e., ~. Normalized throughput is de-

fined to be the ratio ~ where rout is the output

227

generation interval. Normalized latency is the ratio

~ where A is the latency of an invocation and A is

the critical path length. If WR results in 01, we in-

dicate it by plotting an up-down spike for the corre-

sponding load value to indicate that the throughput

ancl the latency are not constant for all invocations at

that input period. The maximum (minimum) value

of the upward (downward) spike corresponds to the

lmaximum (minimum) value of the output generation

interval when the simulation is observed over several

invocations. The middle value corresponds to the av-

erage of the output generation interval values over the

invocations considered to obtain the maximum and

minimum values. The spikes in the latency plots are

obtained similarly. All tasks are assumed to take the

same time; otherwise, the throughput is determined by

the longest task and the AP’s processing the smaller

tasks are underutilized. Thus, this assumption does

not affect the effective throughput. Processing speeds

of AP’s of the multicomputer have been selected in

such a way that ~ = 1 for B = 64bytes/psec and 0.5

for B = 128 bytes/psec. Thus, the first case represents

a more communication intensive application than the

second case. Twelve different values of the input pe-

riod (rin ) are selected between its minimum value of

r. and 57-C. Very large values of the input period are

not interesting because messages from different invo-

cations do not contend with each other.

Fig. 7 shows throughput and latency using WR

and SR techniques plotted on the same axis against

the load for a binary 6-cube with a link bandwidth of

64 bytes/psec. The spikes on the throughput and la-

tency plots for WR indicate 01 for the corresponding

input periods. Throughput spikes are sharper than

the latency spikes because the latency value is much

larger than the input period. It should be noted that

when the middle value of the spike exceeds unity, it in-

dicates that the latency increases monotonically with

WR. Even for the largest input period, WR shows

01. SR gives a feasible schedule for all the load values

for which a maximum utilization of unity or less was

obtained by AssignPaths. The remaining load points

give a utilization value greater than unity precluding

any feasible schedule computation (Refer to Fig. 5).

When the bandwidth is increased to 128 bytes/psec,

feasible path assignments and schedules are obtained

for all load points. Fig. 8 shows the results for a 4X4X4

GHC. Being a topology with more links, it reaches the

required utilization value for more load points than a

binary 6-cube at B = 64 bytes/psec. Except for loads

of 0.5 and 1.0, a feasible schedule is obtained for all.

Mre note the appearance of 01 at B = 128 bytes/~sec

and its removal by SR. Due to the smaller number

of alternative paths in tori, the lowest value of max-

imum utilization reached is greater than unity for all

load points in both the tori at 64 bytes/psec (Refer to

Fig. 6). At 128 bytes/psec, a better performance is

obtained and is shown in Figs. 9 and 10 for 8 x 8 and

4 x4x 4 tori respectively. At the higher bandwidth, the

8 x 8 torus reaches the required utilization for all load

points. Of these, no feasible schedule is obtained for

the three marked by arrows due to failure of message-

interval allocation. SR removes the 01 at load 0.3333.

For 4 x 4 x 4 torus, SR removes all instances of 01 as

shown in Fig. 10 and enables operation at the highest

load while WR does not.

We have used a WR model in which a channel is

considered occupied if a message captures it. In a

stricter model, each channel will be multiplexed be-

tween two virtual channels. As a result, the bandwidth

available to a message is halved and the instances of

01 are likely to increase.

7 Concluding Remarks

In SR, the a priori knowledge about the TFG is

exploited to eliminate run-time overhead of message

routing. Since all the CP’s execute their schedules

independently, this technique is scalable larger mul-

ticomputers if Q can be computed. Existence of f2

depends upon the TFG characteristics, input arrival

period, and effectiveness of task allocation and path

assignment. Since allocation determines the set of al-

ternative paths for each message, coupling it with path

assignment so as to set up less stringent constraints

for SR computation should be explored. Introduction

of feedback between the steps in Fig. 3 to make the

scheduling problems simpler as well as smaller for large

networks deserves investigation. To ensure that all

CP’S in the path have set up the required connections

before message transmission begins, a time interval

equal to or greater than twice the maximum differ-

ence between two clocks could be allowed to elapse be-

fore starting transmission. Formulations of message-

interval allocation and interval scheduling should be

modified to account for these margins and the tight-

ness of CP synchronization required should be studied.

Finally, the suitability of SR to cases where complete

knowledge of the application is not available should

also be studied.

SR achieves our objective of providing a constant

throughput where WR gives 01 in most cases. In acldi-

tion, it enables operation of the multicomputer at the

maximum possible throughput even when WR does

not provide a bounded turnaround time. In general,

01 may appear depending upon the presence of con-

ditions similar to the ones outlined in Section 3. SR

helps remove such undesirable performance. Firstly, it

enables prediction of system performance at compile-

time by deciding if the network meets the communica-

228

tion requirements. Secondly, it provides a method for

flow-control that guarantees satisfaction of these re-

quirements. It can be implemented using simple hard-

ware at each node and is scalable, deadlock-free, and

contention-free.

References

[Agr86]

[AJ88]

[BDW86]

[Da187]

[LLf81]

[L088]

[Nga89]

[S11U90]

[WRHR88]

D. P. Agrawal. Advanced Computer Ar-

chifeeture. IEEE Computer Society, 1986.

R. Agrawal and H. V. J agadish. Parti-

tioning techniques for large-grained paral-

lelism. IEEE Transactions on Computers,

37(12):1627-1634, December 1988.

J. Blazewicz, M. Drabowski,

and J. Weglarz. Scheduling multipro-

cessor tasks to minimize schedule length.

IEEE Transactions on Computers, pages

389-393, May 1986.

W. J. Dally. A VLSI Architecture for

Concurrent Data Structures. Kluwer Aca-

demic Publisher, 1987.

E. L. Lawler and C. U. Martel. Schedul-

ing periodically occurring tasks of mul-

tiple processors. Information Processing

Letters, 12(1), 1981.

V. M. Lo. Heuristic algorithms for task

assignment in distributed systems. IEEE

Transactions on Computersj pages 1384-

1397, November 1988.

John Y. I’Jgai. A Framework for Adaptive

Routing in Multicomputer Networks. PhD

thesis, California Institute of Technology,

1989.

Shridhar B. Shukla. On Parallel Process-

ing for Real-iime Artajicial Vzsion. PhD

thesis, North Carolina State University,

Raleigh, NC 27695, 1990.

C. Weems, E. Riseman, A. Hanson, and

A. Rosenfeld. An integrated image under-

standing benchmark: Recognition of a 2

l/2-D mobile. In Proceedings: Image Un-

derst andzng Workshop, volume 1, pages

111-126. DARPA, April 1988.

&1925

a a a

o4(YI 4CX3 ..-. 4cf3

b bl by

c

f

Figure 1: TFG for DARPA

a = 192b, d, f = 1536C= 32C0~ = 1728

:, h = 768i = 384

Benchmark

1 I In

o 0+ 1102 7 . 1 1

from nbyn

.~ ‘ ? I

2:1de. , roux’

adJacent ~ux,~ . . c;;: “

nodes .sw!tch .. .

–1 1“ ‘-’ ‘-’

mlo 01 1

Input output .Buffem Buffersfor AP for AP

“-1 “-l[ . ! , . .1

to ‘M fmrn AP

Figure 2: Communication processol

TFG, topology

&

to

adja-

cent

nodes

arl:hitecture

Flow-conirol schedule

Figure 3: Steps in scheduled routing

229

/* AssignPaths: to minimize U*/

compute assignment mttial randomly;

set best & current to znttza(

AFLAG = false; /*termination flag*/

while not (A FLAG)

begin

do /*iterative improvement*/

begin

IFLAG = false; /*false when current+f

/*cannot be improved ‘/

find L,/(Li, A~) in current with peak U;

find reroutable messages on Li/(L~, A~);

for each such message

select a path with largest peak

reduction/peak repositioning;

if 3 a reroute to reduce peak

select the reroute with largest reduction;

else select a reroute to reposition peak;

if 3 a reroute that changes peak( current)

select it to update current;

IFLAG = true;

end

while (IFLAG);

if ((peak( current) < peak( best)) or

(position(current) # position(best)))

begin

set best to current;

/*attempt to escape local minima*/

assign paths randomly to each message;

end

else AFLAG = true;

end

Figure 4: Path assignment heuristic

+-’/ bimy fkUk B =64 byteiysec1.0

i

_ DVe, LS12 to MSO

_ OVB, final

0.3 !0.2 0,4 0.6

‘“’~z

1.5-

> 1.4-

=.~ 1.3- GHC (4,4,4), B = 64 bytesl~sec

3— OVB, LSO 10 MSO

.& 1.2 -_ OVB, final

%E

1.1 -

1,0 -

0.9-1 I0.2 04 0.s 0.8 1.0 1.2

normalized load

!Agure 5: U for GHC’S with AssignPaths

1.4I I

0.2 0.4 0.< 08 1.0 12

24torus (4,4,4) — DvB, LSD to MSOB = .54 byteskec _ DV8, final

22-

>

~’=20-?5.%

+s~

1.8 -x“.E

1,6 -

14~0.2 0.4

nmwdizcd 102d

Figure 6: U for tori with AssignPaths

2

230

1,

1.

0

0

binzry 6-cube, B = 64 bytei~secSC,te: U >1.0 when load> 0.3636

a-d

q

i’

_ lhmug,hpt, sch

_ latency, xh

_ Ihrouqhput, wh_ latency, wn

0.2 0.4 0.6 0.8 1.0 ~

bm.ry 6-cube, B = 128 byms@ec

— *

_ throughput, wh_ latency, sch

— thm”gh~”t, sch— lalmcy, sch

02 04 0.6 08 1.0

norrrdzed lozd

kigure 7: l)VB on binary 6-cube

2

2

25-GHC (4,4, 4). B =64 byles/ysec

2.0-

1.5

---.

%.: 1.0-2

5.

0,5

I

P through~cwh

_ Ik.lency, wh_ Ilvoughcut.xh

_ Ialency, sch

00 1

02 04 0.6 08 1.0J3

\

GHC (4,4, 4), B = 128 bytev~ec

b----- *

1 ,,—Ialency, sch

— throwgh~l, sch

_ latency, sch

00

0.2 04 06 0.8 10 1

ncimdkd load

2

1,5

E

wmesge-inmmai

ailccauon fads

1.0

_ lhmughpuf. *h

_ lateccy, wh_ throughput. sch_ lawmcy, x,h

0.2 0.4 06 0.8 1.0

normalized load

Figure 9: DVB on 8 x 8 torus

L--.: , a

1,.,..,..,,,,— throughput, wh

_ Iateocy. wh

— throughput. sch

_ Iatercy, SCPI

0.0 , .—

0.2 0.4 06 0.8 1.0

normalized load

Figure 10: DVB on 4 x 4 x 4 torus

Figure 8: DVB on 4 x 4 x 4 GHC

231