Scheduling pipelined communication in distributed memory multiprocessors for real-time applications
-
Upload
independent -
Category
Documents
-
view
7 -
download
0
Transcript of Scheduling pipelined communication in distributed memory multiprocessors for real-time applications
Scheduling Pipelined Communication in Distributed MemoryMultiprocessors for Real-time Applications
Shridhar B. Shukla
Code EC/Sh, Dept. of Elect. & Computer Engineering
Naval Postgraduate School
Monterey, CA 93943-5000
and
Dharma P. Agrawal
Computer Systems Laboratory
Dept. of Elect. & Computer Engineering
North Carolina State University
Raleigh, NC 27695-7911
Abstract: This paper investigates communication in
distributed memory multiprocessors to support task-
level parallelism for real-time applications. It is shown
that wormhole routing, used in second generation mul-
ticomputers, does not support task-level pipelining be-
cause its oblivious contention resolution leads to out-
put inconsistency in which a constant throughput is
not guaranteed. We propose scheduled routing which
guarantees constant throughputs by integrating task
specifications with flow-control. In this routing tech-
nique, communication processors provide explicit flow-
control by independently executing switching sched-
ules computed at compile-time. It is deadlock-free,
contention-free, does not load the intermediate node
memory, and makes use of the multiple equivalent
paths between non-adjacent nodes. The resource al-
location and scheduling problems resulting from such
routing are formulated and related implementation is-
sues are anal yzed. A comparison with wormhole rout-
ing for various generalized hyp ercubes and tori shows
that scheduled routing is effective in providing a con-
stant throughput when wormhole routing does not and
enables pipelining at higher input arrival rates.
1 Introduction
Mapping an application on multicomputers involves
partitioning, task allocation, node scheduling, and
message routing. These steps are interdependent. For
example, locations of the sources and destinations of
messages, which determine the possible paths, are
fixed by task allocation.
Shrzdhar Shukla is supported in part by the Research
Initiation Program of NPS and Dharma Agrawal k
supported in part by NSF grant No. MIP-8912767.
Permission fo copy without fee all or part of this material E grantedprovided that the copies are not made or distributed for direct commercialadvantage, the ACM copyright notice and the title of the publication andits date appear, and notice Is given that copying Is by permission of the
Association for Computing Machinery. To copy otherwise. or to repubhsh,requires a fee and/or speclflc permission.
Similarly, node scheduling affects the instants for mes-
sage injection, and thus affects channel contention. In
spite of such interdependence that determines the final
system performance, researchers have typically investi-
gated these steps separately. For example, routing al-
gorithms are analyzed to optimize the network latency
and not the system-level performance [Da187, Nga89].
The traffic generated by a node and its frequency
of communication with others are assumed identical
for all nodes. While many scheduling and alloca-
tion strategies make use of deterministic models[Lo88],
routing algorithms are studied for statistically gen-
erated traffic patterns. Although such separate con-
siderations may be justified to make the algorithms
amenable to analysis, they are unjustified in real-time
situations where system-level performance is of pri-
mary importance.
When input data arrives periodically and a se-
quence of processing steps must be carried out on
each input set, as in artificial vision, task-level pipelin-
ing has been shown to be effective [Agr86]. It refers
to periodic invocation of a set of tasks. For its suc-
cessful implementation, the mapping procedure must
ensure that the multicomputer supports a processing
rate equal to the input arrival rate. However, worm-
hole routing (WR), used in second generation mul-
ticomputers such as the Intel iPSC/2 and Symult Se-
ries 2010, is oblivious to the application requirements;
and hence, cannot guarantee a constant throughput
for task-level pipelining.
We propose scheduled routing (SR) which in-
tegrates task specifications with flow-control by exe-
cuting node switching schedules computed at compile-
time and guarantees constant throughput. The re-
source allocation and scheduling problems resulting
from such routing are formulated and related im-
plementation issues are analyzed. A comparison
with WR for various generalized hypercubes (GHC’S)
[Agr86] and tori shows that SR is effective in providing
a constant throughput when JVR does not and enables
pipelining at higher input arrival rates. The organi-
zation of this paper is as follows. A graphical model
01991 ACM 0-89791 -394-9/91/0005/0222 $1.50 222
of computation for task-level pipelining is described in
Section 2. The effect of WR on throughput and condi-
tions that lead to output znconszstency are described in
Section 3. Communication requirements for pipelining
a set of tasks and an introduction to SR are presented
in Section 4. Computation of an explicit flow-control
schedule to implement SR is presented in Section 5.
Performance results for 64 node GHC’s and tori are
presented in Section 6. The paper concludes with re-
marks on SR and suggestions for further research.
2 Task-level Pipelining
A task-flow graph (TFG) is a directed acyclic graph
whose vertices represent tasks and the directed edges
represent messages between tasks. Each task is a set
of operations executed sequentially. A message con-
sists of data or information transferred between tasks.
A TFG is specified as a 2-tuple of sets {ST, SM}.
S7’={T1, TZ, ..., TN, } is the set of IVt tasks. Each Ti
is associated with Ci, the number of operations to be
performed to execute T~. SM = {Ml, AK2, . . . . iVtN~ }
is the set of N~ messages between tasks. Message Mi
contains rni bytes and is sent from task Tis to task T~d.
T%d cannot be started before Mi reaches it. SM defines
a precedence relationship over ST such that Tj < Tk
if there is a path from Tj to Tk. This model treats
identical messages from a task as distinct at the appli-
cation level if they are destined for different tasks. In
task-level pipelining, the TFG is executed periodically
and a task sends messages to other tasks at the end,
and not during, its execution. TFG execution begins
when tasks with no preceding tasks, referred to as zn-
put tasks, begin upon receipt of an external signal such
as input arrival. It is completed when all tasks that
do not precede any other, called output tasks, com-
plete. The start of a TFG execution by input arrival
is called as an invocation. Depending upon the in-
terval between the start of successive invocations, the
task sizes, and precedence relationships among tasks,
different tasks execute in different invocations at the
same time.
Assume a link bandwidth B and processing speed
si for task Ti. Let r. be the processing time for
the longest task and rm be the transmission time
for the longest message. In general, partitioning
techniques attempt to minimize the communication
overhead [AJ88]. In large-grain parallelism repre-
sented by TFG ‘s, it is reasonable to assume that
the longest task is longer than the longest message.
Maximum throughput is obtained when the input
arrival period IFin = Tc. The minimum latency
for an invocation can be computed in terms of the
critical path length. Consider a sequence SCP =
{Z,, MJ,,Ttz, Mj,,, Z~, Mj~,Z~+l,..., ZL} of
tasks and messages such that Til Ls an input task, TiL
is an output task, and Ti~ + Ti~+l , I< L- <(L-l).
The sum, A, of the task execution and message trans-
fer times in Scp is called the critical path length of
the TFG if it is maximum among all such sequences
and SCP is called the critical path. Note that there
may be several critical paths in a TFG. The minimum
time for completion of an invocation is A. Let t;(.)
denote the time at completion, in ~“th TFG invocation,
of a task or message transmission that appears in the
parenthesis. Similarly, let t!(.) denote the start of such
events in the jth invocation. Since all input tasks start
executing simultaneously, the latency for invocation j,
223
~j, is computed as rna%($ (Ti)) – t{ (Tk ) where Tk is
any input task and Ti k any output task. If &t is the
interval between completions of the jth and (j – I)th
invocations, then
&t = Tin + ~j – ~~-1
If invocation period is Tin, the TFG is pipelined suc-
cessfully if
~U~=Ti~Vj=l,2,3,... (1)
Obviously, Tin > ~c; otherwise, there will be an infinite
accumulation at the input of the slowest task. Eq. (1)
states that the interval between arrival of two inputs
must be the same as that between the generation of
two outputs for any pair of successive invocations. If
this condition is violated, the TFG pipelining is de-
fined to have output inconsistency (01).
We have considered an application from the do-
main of artificial vision since task-level pipelining is ef-
fective in obtaining real-time vision [Agr86]. As a spe-
cific example, the DARPA vision benchmark (DVB)
for evaluating parallel architectures which performs
model-based object recognition of a hypothetical ob-
ject, has been used [WRHR88]. If the image data ar-
rives periodically, a recognition result must be com-
puted periodically. A TFG for n object-models is
shown in Fig. 1 1 [Shu90].
3 Inconsistency with “Worm-
hole Routing
Wormhole routing is a blocking variant of cut-through
routing and imposes deterministic path selection via
its routing function to guarantee freedom from dead-
locks. When two messages require a link simulta-
neously, its flow-control hardware resolves contention
using a first-come-first-served (FCFS) policy [Da187].
1 The number of operations is estimated from the data s.uP-
plied with the sequential implementation and the data trans-
ferred is estimated from the size of data structures.
For task-level pipelining, the latency for each invoca-
tionisdetermined by the delays encountered bymes-
sages in that invocation. Given a task allocation, the
predetermined routing function determines the shared
links and associated messages. Since messages of dif-
ferent invocations co-exist in task-level pipelining is
implemented, this service policy results in delays that
are unequal over successive invocations, causing 01. It
results if the conditions given below are true with re-
spect to the routing function, TFG messages, and the
invocation period.
Claim: Let Ml from TI. to Tld and M2 from Tz, to
T.2a be two messages such that Tlij + T2. and all four
tasks are in the critical path. If the paths assigned to
Ml and M2 have a lank in common and Tin is such
that t~(M2) < t~(Ml) + ri~ < t~(M2), then output in-
consistency is present.
Proof: Since M2 of invocation O is available for trans-
mission before Ml of invocation 1, it gains control
of the shared link. Transmission of Ml in invoca-
tion 1 gets blocked until M2 relinquishes the shared
link. The delay in starting Ml in invocation 1 is
6; = t~(M2) – t~(Ml) – Tin (An implicit assump-
tion here is that M2 continues to use all its links
until it is received at the destination. This is justi-
fied since the transmission time is insensitive to the
distance after path setup and messages in large-grain
parallelism represented by TFG’s are sufficiently long
for the transmission delay to dominate the propaga-
tion delay.) Thus, if delay b;, in arrival of Ml at
Tld postpones the start of Tld in invocation 1, it
will also affect the start of T2~ in invocation 1 since
Tld < Tz,. Therefore, t~(Mz) = rin + t~(M2) + ii;
Thusl the output of invocation 1 takes 6; longer to ap-
pear than the output of invocation O. If 6; is such that
t: (Mz) >2 ~,. + t~(Ml), then message Ml of invoca-
tion 2 is available for transmission before M2 invoca-
tion 1. Thus, t~(M1) < t~(M2) and M? k not delayed.
Therefore, start of T~d is not delayed in invocation 2.
This cycle repeats and in alternate invocations, start
of Tld is delayed. As a result, successive outputs are
generated after unequal intervals. If 6; is such that
tl(M2) < 2~in+ , i (Ml) becomes less thantO(Ml) then ~,
t~-l(Mz) after ,more than 1 revocations. Eventually,
when it does, t;(Ma) is not delayed and the cycle re-
peats. This shows that, depending upon the TFG and
multicomputer parameters, the FCFS flow-control in
WR leads to output inconsistency when used for task-
level pipelining. ■
Although we have considered only a pair of in-
teracting messages above, when a complex TFG is
mapped on a large network, several messages may in-
teract to cause 01. Even when path selection is sen-
sitive to the network load and makes use of the mul-
tiple equivalent paths in the network, as in adaptive
cut-through routing [Nga89], 01 may result. For ex-
ample, consider three messages Ml from Tl$ to Tld,
M2 from T2~(= T1. ) to T2d, and M3 from T3, to T~d
such that Tzd < T3~, and Tzd, T3~, Tsd are in the criti-
cal path. Assume that task allocation makes messages
Ml and M3 between adjacent nodes and M2 between
non-adjacent nodes. If message Ml blocks the first
link in one of the paths of M2 and the adaptive flow-
control commits it to a path that contains a link used
by M3, an argument similar to the one above can be
given to show that 01 results.
4 Scheduled Routing
In WR, messages which are less important to the com-
pletion of the current invocation may receive priority
over others. SR prevents this by exercising explicit
flow-control which guarantees constant throughputs.
In this technique, path selection and flow-control are
based on TFG communication requirements derived
using Eq. (1). For maximum throughput (rin = ~C),
each task is executed once every rc time units and it
must receive messages from its preceding tasks at the
same rate. Since each task begins as soon as messages
from all preceding tasks are received, in the Oth invo-
cation, Mi is available for transmission at r: = t; (Ti~ )
and must complete transmission by ct~ = r: + ~C.
Pipelining of periodically arriving messages is possi-
ble if each message, M~, is assigned a release-time,
ri = t!(Ti. ) and a deadline, di = r= + t~(Tj$). Mi
must be transmitted in interval [ri, di] if dz > ri or in
[0, di] and [ri, Tin] when di < ri to guarantee its dead-
line when the TFG is pipelined with period 7,. ~ rc.
By allowing each message transmission to be as long as
the longest task, latency may increase, but the max-
imum possible throughput remains the same. Since
all messages are generated with the same frequency,
these time bounds enable consideration of all succes-
sively generated messages between any pair of com-
municating processors by observing only a single time
frame of [0, rin]. Therefore, multiple messages in tran-
sit between a pair of processors need not be taken into
account explicitly. These time-bounds lay the founda-
tion for SR and are used with path selection and flow-
control by computing communication schedules to be
executed independently by communication processors
(CP’S) at the rnulticomputer nodes.
4.1 Node Switching Schedules
SR transmits a message only when there is a clear
path between its source and destination. Since each
message has an unobstructed path, the run-time flow-
control overhead evaporates and deadlocks become a
non-issue. Such transmission is achieved by compile-
224
time determination of a path for Mi in terms of its
source and destination addresses and its time bounds.
Consider a communication schedule, Q, that spec-. .
ifies a node swltchlng schedule, wi, for each node
iVi of the multicomputer. wi specifies the connections
between input and output channels to be setup in the
CP at node N, over [0, -rin]. The node switching sched-
ules are such that execution of wi at Ni results in a
clear path for each message for the required duration
within its timing constraints.
Fig. 2 shows a functional block diagram of the CP.
The controller sets up the crossbar and multiplexer
according to the switching schedule. If the topology
has degree n, each CP has n + 1 input and n i- 1 out-
put connections, n of which are to and from adjacent
nodes. It also connects to the input/output buffers
the local application processor (AP). It has separate
buffers for each input and output channel enabling it
to simultaneously transmit and receive messages on
separate channels for its AP. A message coming from
a neighboring CP, however, can be sent on only one
outgoing channel. By simultaneously executing these
schedules at all CP’S, clear paths are established for
delivering each message within its timing constraints.
Computation and execution of wi’s assumes the basic
time unit to be the time for a single packet transmis-
sion and ignores the propagation delay in comparison
with the transmission delay. Since we are exploiting
task-level parallelism, the message size is assumed to
be large enough for the transmission delay to domi-
nate. The switching time of each CP is assumed to be
negligible. The network links support bidirectional,
half-duplex traffic and the time taken to reverse the
direction of traffic on a link is also assumed negligible.
Every packet of a message travels along the same path.
SR exploits multiple equivalent paths between non-
adj scent nodes so that none of the network links is
overloaded in any part of [0, qn]. Path assignment
to messages should be such that the traffic is spread
out evenly over the network and time. When the time
available for transmission of a message is greater than
its length, it has a slack which can be used in com-
puting wi’s Thus, a message is allocated to different
intervals of [0, r%n] based upon its links and timing
constraints. We refer to this problem of partition-
ing a message among sub-intervals of [0, Tin] as the
message-intervaI allocation problem. In addition,
simultaneous availability of all the links for a message
must be ensured. Therefore, all the message alloca-
tions to a single interval are scheduled explicitly for
transmission within the interval, We refer to this prob-
lem as the interval scheduling problem. Thus, the
process of computing Q must be carried out using the
steps shown in Fig. 3. The possibility y of feedback in
this process is discussed in the last section.
5 Flow-control Scheduling
5.1 Path Assignment
An optimal path assignment is one that leads to a fea-
sible 0 if one exists. An Q is feasible if it does not
require any link of the network to exceed its capacity
and ensures that no link is required by two messages
at the same time. To find such an assignment requires
solution of the message-interval allocation and inter-
val scheduling problems for each assignment. This is
prohibitively expensive for any reasonable size of the
network. It is possible, however, to eliminate certain
path assignments by checking if they satisfy the nec-
essary conditions for subsequent computatic,n of a fea-
sible Q. Thus, the net utilization of each link over the
invocation period and its intervals gives a criterion to
accept or reject a path assignment.
Given message time bounds, two different utiliza-
tions may be defined. The distinct ri and di values of
all messages divide [0, Tin] into non-overlap ping inter-
vals. Letto=O<t1<t2 <ts<...<tK=r~n
be the K distinct endpoints of intervals in which mes-
sages are available for transmission. Interval [tk _ 1, tk]
is denoted by Ak and a message is active in Ak if it
is available for transmission during [tk_1,tk].Let A =
[ai~] be an Nm x K matrix, called the message activity
matrix, such that aik = 1 if Mi is active in Ak and O
otherwise. For each message,
(2)
An equality in (2) indicates that Mi has no slack. A
no-slack message utilizes the links in its path fully in
its active intervals and no other message can use them
during these intervals. On the other hand, if (2) is
an inequality, links used by Mi have room for carry-
ing other messages during its active intervals. The
net utilization over [0, Tin] of links used by messages
with slack can be determined, but their utilization
profile cannot be determined unless message-interval
allocation is done. If there are N/ links in the topol-
ogy, define an Nm x Nl path assignment matrix as
B = [bij] such that btj = 1 if message Mi uses link Lj
and bij = O otherwise.
Definition 5.1 Link Utilization, U;, for link Lj is de-
fined as the ratio of the sum of transmission times of
ail messages carried by Li and the sum of lengths of
all intervals in which at least one message M active on
Lj . Thus, if Qj as the set of values of index k such
that 3 a message Mi with a~kb~j # O,
225
U; < 1 implies that the total transmission require-
ment of all the messages using link Lj does not exceed
the total time in which it is used. It, however, does
not imply absence of hot-spots characterized by com-
bination of no-slack messages using a link in the same
interval. A path assignment must also eliminate such
hot-spots in the network. Therefore, spot utilization
may be defined as below.
Definition 5.2 Spot utilization, U~k, is defined as the
number of no-slack messages using Lj in the interval
[tk-1, tk]. Thus, u~k = ~~~ aikbij V Mi for which
(2) is an equality.
The path assignment must be such that U;k < 1
for all the spots. To increase the probability of finding
a feasible Q, path assignment minimizes
where the maximum is taken over all possible values
of j and k. If U < 1,SR can be attempted; otherwise,
the TFG communication requirements exceed the link
capacity.
Optimal Path Assignments
Since the number of alternative paths for a message
increases very rapidly with the number of hops, a large
number of distinct path assignments must be inspected
to find the one with the lowest U. For example, if there
are .z messages with at least two alternative paths, the
tot al number of distinct path assignments exceeds 2“.
Therefore, we provide a heuristic for path assignment
that minimizes the peak utilization.
The algorithm AssignPaths of Fig. 4 performs it-
erative improvement to reduce U. An assignment is
improved in the following manner. If U is due to a
hot-spot, say (Lj, Ak), all the messages that are active
in the Ak and use Lj are identified. When U is due
to a link, all the messages that use it over any of the
K intervals are identified. For each multi-hop message
among these, all the alternative paths are considered
and the one that causes the largest drop in U is iden-
tified. Then, the message-path combination (called a
reroute in Fig. 4) that causes the largest drop among
all multi-hop messages is selected and the message is
reassigned to its new path. It may not be possible to
change U by any reroute except to make the same U
value appear on a different link and/or spot. In this
case, the reroute that causes this change is made so
that the algorithm moves to a different point in the
link-interval space. This helps in reducing U when
there are multiple appearances of the same value in
different links and/or spots. The random assignment
of paths at the end of each iteration helps the algo-
rithm slide out of any local minima.
Performance of AssignPaths for DARPA Vision
Benchmark (DVB) TFG on 64 node GHC’S and tori
is shown in Figs. 5 and 6. A common method of as-
signing paths to messages between two nodes which
also yields deadlock-free routing in GHC’S is to pro-
ceed along the path realized by changing the source
node address starting from the lowest dimension or
the least significant digit (LSD) and proceeding to-
wards the highest one or the most significant digit
(MSD) until it becomes same as the destination ad-
dress. Clearly, the utilization of links is uneven in
LSD-to-MSD path selection and balanced in that given
by algorithm AssignPaths. Note that the minimum
possible peak utilization changes as the input period,
rim, changes because the message time-bounds change.
We have, therefore, plotted the improvement in peak
utilization for various values of the network load which
is defined as ~. Thus, when the TFG is pipelined
with maximum input arrival rate, the network load is
maximum. We observe that the utilization achieved
by AssignPaths is at least as low as the LSD-to-MSD
routing.
5.2 Message-interval Allocation
Message-interval allocation can be performed on dis-
joint subsets of TFG messages that are identified by
the following definitions.
Definition 5.3 Two messages MP and Mq are related
if and only if either 3 a link Li and an interval Ak such
that bPi = bqi = 1 and aPk = a~k = 1 or ~ a message
M, which is related to both MP and Mq. ■
Definition 5.4 A set of messages, SR, is mazimal if
and only if ‘d Mi, Mj E SR, Mi and Mj are related
and ~ Mk @’SR, such that Mk and hfi are related. ■
The transitivity and closure of the relation between
messages defined above partitions SM into disjoint
subsets. Let S; denote the ith maximal subset. The
maximality of each subset ensures that S~ (7 ,S’~ = @
for i # j. Such subsets of SM can be obtained by per-
forming row operations on an identity matrix and the
matrices A and B.
Consider a maximal subset SR with Q. =
{nl, nz, . . .} as the indices of messages in it. Also, let
Q~ = {~1, ~2, . . .} ad Q~ = {LI, kz, . . .} be the indicesof the links and intervals used by messages in SR. Let
Xhj denote the time for which Mnh is transmitted in
Ak,, i, e., between [t(k, _I), fk,]. A feasible allocation
of each message to the intervals in which it is active is
obtained by assigning values to Xhj (~ O) such that
(3)
226
x &hkJhthr,xhj ~ tk,‘tk,–l
nhCQ.b’liGQl,kjEQk (4)
Constraints (3) require that the total allocation for
each message equals its duration at the given band-
width. Constraints (4) require that the total alloca-
tion of all messages to a link in each interval does not
exceed the interval length. This allocation problem
is analogous to scheduling periodic tasks on multiple
processors [LM8 1]. The similarity between the two
formulations is that the messages constitute periodic
tasks and links constitute processors. The difference
is that a message may simultaneously require several
links, but a task must be processed by exactly one pro-
cessor at any time. If the set of constraints (3) and (4)
is feasible, a suitable message-interval allocation can
be obtained. The feasibility of such systems can be
checked by standard integer programming techniques.
The values of Xhj given by a feasible solution are then
used to initialize an iV~ x K message-interval alloca-
tion matrix, P = ~~k], where p~~ gives the time for
which illfi is transmitted in Ak.
5.3 Interval Scheduling
The outcome of message-interval allocation is informa-
tion about how much of each message is transmitted
in an interval. Explicit flow-control requires all the
links assigned to a message to be available simulta-
neously so that a clear path from the source to des-
tination is established. Therefore, interval scheduling
must be performed for each interval used by each max-
imal subset to construct Q. This problem is one of
preemptive scheduling of tasks on a set of identical
processors where a task may require simultaneous use
of more than one processors. The set of links used
in an interval could be considered as the set of pro-
cessors and messages with non-zero allocations could
be considered as the tasks. An integer programming
formulation for this scheduling problem has been de-
scribed in [BDW86] and is adapted to this application
as described below.
Let Q: = {nl, nz, . . .} be the indices of messages
in a maximal subset that have non-zero allocations in
Ak. Let Q? ={il, iz, . . .} be the indices of links used
by messages in Q!. We first define a link feasible set
of messages.
Definition 5.5 A set of messages with indices QIjS =
{nl, nz,...} is MC feasible if ~ ni, nj G Q,~, such thatMn, and IVIn, use the same link. ■
Clearly, messages in a link feasible set can be trans-
mitted simultaneously. Let IVrf, be the number of
link feasible sets possible for Q:. Associate a variable
yj with jth link feasible set representing the time for
which all the messages in it are transmitted simultane-
ously. Let Qi be the set of indices of link feasible sets
of messages in Q: which contain message ill~. Con-
sider the integer programming problem of minimizing
Ntf ,
E D~=l’~k’”Q:yj so that
j=l jEQ,
If ~~~~ yj < Ak, then the interval is schedulable;
otherwise, the messages with nonzero allocations in
Ak require more time than the interval length. Once
again, this problem may be solved as a standard inte-
ger programming problem. The number of constraints
is proportional to the number of messages in the set
but the number of variables is O(Nlf, ). Depending
upon the particular path assignment for the messages
in Q:, the number of variables may be O(21rk ) where
IVk is the number of messages in Q:.
5.4 Computation of Node Switching
Schedules
Solution of the interval scheduling problem for each in-
terval of each maximal set enables derivaticm of Ui ‘s.
A typical scheduling command in wi contains the time
of execution and the sources of the data for the input
buffers and output channels. The sources are the in-
put channels and out put buffers. The link feasible sets
could be transmitted in any sequence in an interval.
Given a sequence, the starting time for messages in a
set is determined by adding the total time for which
sets before it are scheduled to the starting time of the
interval. This time is then used to derive the schedul-
ing commands. The maximum number of scheduling
commands in an Wi is the number of packets that can
be transmitted in [0, ~in]. Synchronization of CP’S in
the path of a message required by such routing can be
achieved by periodic synchronizing messages [Shu90].
6 Performance of Scheduled
Rout ing
In order to compare the performance of SR with WR,
we have simulated the latter and computed the for-
mer for various tori and GHC’S. Our objective is to
observe the 01 due to WR for various values of load
on the network and check if Q is obtained for the same
value of the input period. If it is, the throughput cor-
responds to Tin and the Iatency to the criticaI path
length obtained after assigning time bounds to all the
messages.
Normalized load is defined as the ratio of the min-
imum possible input arrival period and the input ar-
rival period, i. e., ~. Normalized throughput is de-
fined to be the ratio ~ where rout is the output
227
generation interval. Normalized latency is the ratio
~ where A is the latency of an invocation and A is
the critical path length. If WR results in 01, we in-
dicate it by plotting an up-down spike for the corre-
sponding load value to indicate that the throughput
ancl the latency are not constant for all invocations at
that input period. The maximum (minimum) value
of the upward (downward) spike corresponds to the
lmaximum (minimum) value of the output generation
interval when the simulation is observed over several
invocations. The middle value corresponds to the av-
erage of the output generation interval values over the
invocations considered to obtain the maximum and
minimum values. The spikes in the latency plots are
obtained similarly. All tasks are assumed to take the
same time; otherwise, the throughput is determined by
the longest task and the AP’s processing the smaller
tasks are underutilized. Thus, this assumption does
not affect the effective throughput. Processing speeds
of AP’s of the multicomputer have been selected in
such a way that ~ = 1 for B = 64bytes/psec and 0.5
for B = 128 bytes/psec. Thus, the first case represents
a more communication intensive application than the
second case. Twelve different values of the input pe-
riod (rin ) are selected between its minimum value of
r. and 57-C. Very large values of the input period are
not interesting because messages from different invo-
cations do not contend with each other.
Fig. 7 shows throughput and latency using WR
and SR techniques plotted on the same axis against
the load for a binary 6-cube with a link bandwidth of
64 bytes/psec. The spikes on the throughput and la-
tency plots for WR indicate 01 for the corresponding
input periods. Throughput spikes are sharper than
the latency spikes because the latency value is much
larger than the input period. It should be noted that
when the middle value of the spike exceeds unity, it in-
dicates that the latency increases monotonically with
WR. Even for the largest input period, WR shows
01. SR gives a feasible schedule for all the load values
for which a maximum utilization of unity or less was
obtained by AssignPaths. The remaining load points
give a utilization value greater than unity precluding
any feasible schedule computation (Refer to Fig. 5).
When the bandwidth is increased to 128 bytes/psec,
feasible path assignments and schedules are obtained
for all load points. Fig. 8 shows the results for a 4X4X4
GHC. Being a topology with more links, it reaches the
required utilization value for more load points than a
binary 6-cube at B = 64 bytes/psec. Except for loads
of 0.5 and 1.0, a feasible schedule is obtained for all.
Mre note the appearance of 01 at B = 128 bytes/~sec
and its removal by SR. Due to the smaller number
of alternative paths in tori, the lowest value of max-
imum utilization reached is greater than unity for all
load points in both the tori at 64 bytes/psec (Refer to
Fig. 6). At 128 bytes/psec, a better performance is
obtained and is shown in Figs. 9 and 10 for 8 x 8 and
4 x4x 4 tori respectively. At the higher bandwidth, the
8 x 8 torus reaches the required utilization for all load
points. Of these, no feasible schedule is obtained for
the three marked by arrows due to failure of message-
interval allocation. SR removes the 01 at load 0.3333.
For 4 x 4 x 4 torus, SR removes all instances of 01 as
shown in Fig. 10 and enables operation at the highest
load while WR does not.
We have used a WR model in which a channel is
considered occupied if a message captures it. In a
stricter model, each channel will be multiplexed be-
tween two virtual channels. As a result, the bandwidth
available to a message is halved and the instances of
01 are likely to increase.
7 Concluding Remarks
In SR, the a priori knowledge about the TFG is
exploited to eliminate run-time overhead of message
routing. Since all the CP’s execute their schedules
independently, this technique is scalable larger mul-
ticomputers if Q can be computed. Existence of f2
depends upon the TFG characteristics, input arrival
period, and effectiveness of task allocation and path
assignment. Since allocation determines the set of al-
ternative paths for each message, coupling it with path
assignment so as to set up less stringent constraints
for SR computation should be explored. Introduction
of feedback between the steps in Fig. 3 to make the
scheduling problems simpler as well as smaller for large
networks deserves investigation. To ensure that all
CP’S in the path have set up the required connections
before message transmission begins, a time interval
equal to or greater than twice the maximum differ-
ence between two clocks could be allowed to elapse be-
fore starting transmission. Formulations of message-
interval allocation and interval scheduling should be
modified to account for these margins and the tight-
ness of CP synchronization required should be studied.
Finally, the suitability of SR to cases where complete
knowledge of the application is not available should
also be studied.
SR achieves our objective of providing a constant
throughput where WR gives 01 in most cases. In acldi-
tion, it enables operation of the multicomputer at the
maximum possible throughput even when WR does
not provide a bounded turnaround time. In general,
01 may appear depending upon the presence of con-
ditions similar to the ones outlined in Section 3. SR
helps remove such undesirable performance. Firstly, it
enables prediction of system performance at compile-
time by deciding if the network meets the communica-
228
tion requirements. Secondly, it provides a method for
flow-control that guarantees satisfaction of these re-
quirements. It can be implemented using simple hard-
ware at each node and is scalable, deadlock-free, and
contention-free.
References
[Agr86]
[AJ88]
[BDW86]
[Da187]
[LLf81]
[L088]
[Nga89]
[S11U90]
[WRHR88]
D. P. Agrawal. Advanced Computer Ar-
chifeeture. IEEE Computer Society, 1986.
R. Agrawal and H. V. J agadish. Parti-
tioning techniques for large-grained paral-
lelism. IEEE Transactions on Computers,
37(12):1627-1634, December 1988.
J. Blazewicz, M. Drabowski,
and J. Weglarz. Scheduling multipro-
cessor tasks to minimize schedule length.
IEEE Transactions on Computers, pages
389-393, May 1986.
W. J. Dally. A VLSI Architecture for
Concurrent Data Structures. Kluwer Aca-
demic Publisher, 1987.
E. L. Lawler and C. U. Martel. Schedul-
ing periodically occurring tasks of mul-
tiple processors. Information Processing
Letters, 12(1), 1981.
V. M. Lo. Heuristic algorithms for task
assignment in distributed systems. IEEE
Transactions on Computersj pages 1384-
1397, November 1988.
John Y. I’Jgai. A Framework for Adaptive
Routing in Multicomputer Networks. PhD
thesis, California Institute of Technology,
1989.
Shridhar B. Shukla. On Parallel Process-
ing for Real-iime Artajicial Vzsion. PhD
thesis, North Carolina State University,
Raleigh, NC 27695, 1990.
C. Weems, E. Riseman, A. Hanson, and
A. Rosenfeld. An integrated image under-
standing benchmark: Recognition of a 2
l/2-D mobile. In Proceedings: Image Un-
derst andzng Workshop, volume 1, pages
111-126. DARPA, April 1988.
&1925
a a a
o4(YI 4CX3 ..-. 4cf3
b bl by
c
f
Figure 1: TFG for DARPA
a = 192b, d, f = 1536C= 32C0~ = 1728
:, h = 768i = 384
Benchmark
1 I In
o 0+ 1102 7 . 1 1
from nbyn
.~ ‘ ? I
2:1de. , roux’
adJacent ~ux,~ . . c;;: “
nodes .sw!tch .. .
–1 1“ ‘-’ ‘-’
mlo 01 1
Input output .Buffem Buffersfor AP for AP
“-1 “-l[ . ! , . .1
to ‘M fmrn AP
Figure 2: Communication processol
TFG, topology
&
to
adja-
cent
nodes
arl:hitecture
Flow-conirol schedule
Figure 3: Steps in scheduled routing
229
/* AssignPaths: to minimize U*/
compute assignment mttial randomly;
set best & current to znttza(
AFLAG = false; /*termination flag*/
while not (A FLAG)
begin
do /*iterative improvement*/
begin
IFLAG = false; /*false when current+f
/*cannot be improved ‘/
find L,/(Li, A~) in current with peak U;
find reroutable messages on Li/(L~, A~);
for each such message
select a path with largest peak
reduction/peak repositioning;
if 3 a reroute to reduce peak
select the reroute with largest reduction;
else select a reroute to reposition peak;
if 3 a reroute that changes peak( current)
select it to update current;
IFLAG = true;
end
while (IFLAG);
if ((peak( current) < peak( best)) or
(position(current) # position(best)))
begin
set best to current;
/*attempt to escape local minima*/
assign paths randomly to each message;
end
else AFLAG = true;
end
Figure 4: Path assignment heuristic
+-’/ bimy fkUk B =64 byteiysec1.0
i
_ DVe, LS12 to MSO
_ OVB, final
0.3 !0.2 0,4 0.6
‘“’~z
1.5-
> 1.4-
=.~ 1.3- GHC (4,4,4), B = 64 bytesl~sec
3— OVB, LSO 10 MSO
.& 1.2 -_ OVB, final
%E
1.1 -
1,0 -
0.9-1 I0.2 04 0.s 0.8 1.0 1.2
normalized load
!Agure 5: U for GHC’S with AssignPaths
1.4I I
0.2 0.4 0.< 08 1.0 12
24torus (4,4,4) — DvB, LSD to MSOB = .54 byteskec _ DV8, final
22-
>
~’=20-?5.%
+s~
1.8 -x“.E
1,6 -
14~0.2 0.4
nmwdizcd 102d
Figure 6: U for tori with AssignPaths
2
230
1,
1.
0
0
binzry 6-cube, B = 64 bytei~secSC,te: U >1.0 when load> 0.3636
a-d
q
i’
_ lhmug,hpt, sch
_ latency, xh
_ Ihrouqhput, wh_ latency, wn
0.2 0.4 0.6 0.8 1.0 ~
bm.ry 6-cube, B = 128 byms@ec
— *
_ throughput, wh_ latency, sch
— thm”gh~”t, sch— lalmcy, sch
02 04 0.6 08 1.0
norrrdzed lozd
kigure 7: l)VB on binary 6-cube
2
2
25-GHC (4,4, 4). B =64 byles/ysec
2.0-
1.5
---.
%.: 1.0-2
5.
0,5
I
P through~cwh
_ Ik.lency, wh_ Ilvoughcut.xh
_ Ialency, sch
00 1
02 04 0.6 08 1.0J3
\
GHC (4,4, 4), B = 128 bytev~ec
b----- *
1 ,,—Ialency, sch
— throwgh~l, sch
_ latency, sch
00
0.2 04 06 0.8 10 1
ncimdkd load
2
1,5
E
wmesge-inmmai
ailccauon fads
1.0
_ lhmughpuf. *h
_ lateccy, wh_ throughput. sch_ lawmcy, x,h
0.2 0.4 06 0.8 1.0
normalized load
Figure 9: DVB on 8 x 8 torus
L--.: , a
1,.,..,..,,,,— throughput, wh
_ Iateocy. wh
— throughput. sch
_ Iatercy, SCPI
0.0 , .—
0.2 0.4 06 0.8 1.0
normalized load
Figure 10: DVB on 4 x 4 x 4 torus
Figure 8: DVB on 4 x 4 x 4 GHC
231