Cost-sensitive analysis of communication protocols

37
LO FMASSACHUSETTS LABORATORY FOR [ INSTITUTE OF COMPUTER SCIENCE TECHNOLOGY f"MIT/LCS/TM-453 °' * DTIC ELECTE AD-A237 356 S JUN2 G1391 I III 11111ii II 11 = c COST-SENSITIVE ANALYSIS OF COMMUNICATION PROTOCOLS Baruch Awerbuch Alan Baratz David Peleg June 1991 Approved for puLIic ro1wism;. Thatribution UtuUuJted 545 TECHNOLOGY SQUARE, CAMBRIDGE, MASSACHUSETTS 02139 91-03524

Transcript of Cost-sensitive analysis of communication protocols

LO FMASSACHUSETTSLABORATORY FOR [ INSTITUTE OFCOMPUTER SCIENCE TECHNOLOGY

f"MIT/LCS/TM-453 °'* DTICELECTE

AD-A237 356 S JUN2 G1391 IIII 11111ii I I 11 = c

COST-SENSITIVE ANALYSISOF COMMUNICATION

PROTOCOLS

Baruch AwerbuchAlan BaratzDavid Peleg

June 1991

Approved for puLIic ro1wism;.Thatribution UtuUuJted

545 TECHNOLOGY SQUARE, CAMBRIDGE, MASSACHUSETTS 02139

91-03524

REPORT DOCUMENTATION PAGEIa. REPORT SECURITY CLASSIFiCA71ON Ib RESTRICTIVE MARKINGS

U~ncl Iass if ied ___________________________

2a. SECURITY C'LASSiFCAr;ON AUTHORITY .3. DISTRIBUTION /'AVAILABILITY OF REPORT

Approved far public release; distribut:on,2b. DECL ASSiFiCATION/IDOWVNGRADING SCHEDULE is unlimrited.

-1 PERFORMING O~RGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMBER(S)

M1T/LCS/TMl 453 N00014-89--J-1988

6a. NAME OF PERFORMING ORGANIZATION 6b OFFICE SYMBOL 7a. NAME OF MONITORING ORGANIZATION

Mcomoutfo Cocience (If applicable) Office of N-1aval IResearcn/Deot. of Navy,

5c. ADDRESS Gtry, State, and ZIP Coije) 7b. ADDRESS(City, State, and ZIP Code)

~-o can._-v Szuare inrormacion Systems Programb~~e r U 23 9 Arlington, VA 22217

3a. NAME 0;~DIG SPONSORING Elb. OFFICE SYMBOL 9 PROCUREMENT NSTRUMENT IDENTIFICATION NUMBER

DRGAN I3D7O (if applicable)

3c. AIDDRESS (City, State, ana ZIP Code) 70, SOURCE OF FUNDING NUMBERS

'm0 '-7 4 3 PROGRAM PROJECT ITASK WORK UNIT1 W TELEMENT NO. -NO. INO ACCESSION -NO.

1TITLE: Include Security Ciassificatian)

Cost-Sensitive Analvsis of Communication Protocols

'Z. PERSONAL AUTHOR(S)Baruchi Awerbuch, Alan Baratz, David Peleg

2a. 7YPE OF RE'PCRT 13b 7IME COVERED 14, DATE OF REPORT (Year Month, Oay 1PAGE C NT:ROm T June 19 1 3

6SUP01LEMENTARY NOA~iON4

CCSA7 CODES !8. SUBJECT TERMS (Continue on reverse if necessary anid identify by bliock number)

FIELD CCup cSU-GROUP

9 AESTRACT ',Continue on reverse if necessary anid identify by block number)

This paper introduces the notion of cost-sensitive communication complexity and

exemplifies it on the following basic communication problems: computing a global

function, network synchronization, clock synchronization, controlling protocols' worst-

case execuion, Connected components. spanning tree, etc., constructing a minimum

.spanning tree, constructing a shortest path tree.

2037 D5RBUT ON, AJAILABILiTY OF ABSTRACT 21. ABSTRACT SECURITY CLASSIFICATIONSUNCLAS~iFIEDWINLIMITED 0 SAME As RPT DTIC USERS Unclassifi4ed

"~a NAME OP RESPONSIBLE INDIVIDUAL 22b. TELEPHONE (Include Area Code) 22c. OFFICE SYMBOL

DO FORM 1473,3 amA 83 APR edition may be used until exhausted SECURITY CLASSIFICATION OF THIS PAGEAll other edition~s are obsolete

*,±S. Got Ai~ Off im: 11114-417-Wd

Unc lass i.fi2ed

Cost-Sensitive Analysis of Communication Protocols

Baruch Awerbuch * Alan Baratz t David Peleg t

June 13, 1991

Abstract

This paper introduce.- the notion of cost-sensitive communication complexity and

exemplifies it on the following basic communication problems: computing a global

function, network synchrorization, clock synchronization, controlling protocols' worst-

case execution, connected onmpnnents, spanning tree, etc., constructing a minimum

spanning tree, constructing a shortest path tree.

Ac te.s 1a for

St stirlb atIon .

%'; " Availability 00"0po.

-, lAvail anifr

I

*Department of Mathematics and Laboratory for Computer Science, M.I.T., Cambridge, MA 02139.

ARPA: [email protected]. Supported by Air Force Contract TNDGAFOSR-86-0078, ARO contractDAALO3-86-K-0171, NSF contract CCR8611442, DARPA contract N00014-89-J-1988, and a special grant

from IBM. Part of the work was done while visiting IBM T.J. Watson Research Center.t IBM T.J. Watson Research Center, Yorktown Heights, NY 10598.I Department of Applied Mathematics and Computer Science, The Weizmann Institute, Rehovot 76100,

Israel. BITNET: peleg~wisdom. Supported in part by an Allon Fellowship, by a Bantrell Fellowship and bya Hlaas Career Development Award. Part of the -"ork was done while visiting MIT and IBM T.J. Watson

Research Center.

1 Introduction

1.1 Motivation

Traffic load is one of the major factors affecting the behavior of a communication network.

This fact is well recognized, and is the reason why most models for communication networks

and most algorithms for routing, traffic analysis etc. model the network using a weight

function on the edges, capturing this factor. In this model, the we;ght of an edg rcfizct

the estimated delay for a message transmitted on this edge, and thus also the cost for

using this edge. The significance of the load factor has also motivated the intense study of

efficient methods for performing basic network tasks such as computing shortest paths and

constructing minimum weight spanning trees (with length / weight defined with respect to

the weight function).

However, in most of the previous work on distributed algorithms for these and other

tasks, the design and analysis of the algorithms themselves completely disregards this weight

function. That is, transmission over all the edges is assumed to be equally costly and

completed within the same time bound. Such assumptions are made even when the task

performed by the algorithm is directly related to the edge costs, and the algorithm has to be

executed over the same network, and thus suffer the same delays. This seems to contradict

the very purpose towards which the tasks are performed. It is sometimes argued that it is not

crucial to take the weights into account when considering such "network service" algorithms,

since these algorithms occupy only a thin slice of the network's bandwidth. Nonetheless, it

is clear that an algorithm that can do well in that respect is preferable to one that ignores

the issue.

This paper proposes an approach enabling us to take traffic loads into account in the

design of distributed algorithms. This issue is addressed by introducing cost-sensitive com-

plexity measures for analysis of distributed protocols. We consider weighted analogs for both

communication and time complexity. We then examine a host of basic network problems,

such as connectivity, computing global functions, network synchronization, controlling the

worst-case execution of protocols, and constructing minimum spanning trees and shortest

)ath trees. For each of these problems we seek to establish some lower bounds and propose

some efficient algorithms with respect to the new complexity measures.

We feel that the approach proposed in this paper may serve as a basis for a more accurate

account of the behavior of distributed algorithms in communication networks.

1.2 The model

We consider the standard model of (static) asynchronous communication networks. We

consider a communication graph G = (V, E, w), where a weight w(e) is associated with each

(undirected) edge of the network. We denote n = IJV, m = IEl. We also denote by W the

maximal weight w(e) of a network edge, W = max(u,v)EE w(u, v). We make the assumption

that W = poly(n), and thus log W = O(logn). For any subgraph G' = (V',E',w) of G,

let w(G') denote the total weight of G', i.e., w(G') = EEE w(e). Let dist(u, v, G') be the

weighted distance from u to v in G', i.e., the minimum of w(p) over all paths p from u to

v in G', and let Path(u, v, ') denote some arbitrary path achieving this minimum. Let

Diam(G') denote the diameter of G', i.e., maxl,,,Ev, dist(u, v, G'). Given a tree T and two

vertices x, y in it, denote byr---t.r , y, T) th path from x to y in T.

Next let us define some basic graph notation. For a vc.tex v C ', Ict

Rad(v,G) = max(distG(v,w)).wEV,

Given a set of vertices S C V, let G(S) denote the subgraph induced by S in G. A cluster

is a subset of vertices S C V such that G(S) is connected. The radius of a cluster S is

denoted Rad(S) = minlE.s Rad(v, G(S)). A cover is a collection of clusters S = IS..., Sm}

such that Uj S, = V. Given a collection of clusters S, let Rad(S) = maxi Rad(S). For

every vertex v C V, let degs(v) denote the degree of v in the hypergraph (V,S), i.e., the

number of occurrences of v in clusters ,S E S. The maximum degree of a cover S is defined

as A(S) = maxEv degs(v).

Given two covers S = {, . .... and T = {'.....k}, we say that T subsumes S if

for every Si E S there exists a 'i' c T such that 5;, C Ti.

In the sequel we make us,- -f the F)llowing theorem of [AP91].

Theorem 1.1 [,,P911 Give-, ., graph (; V, E), IVI = n, an initial cover S and an integer

k > 1, it is oossible to construct a cover T that satisfies the following properties:

(1) T subsumes S,

(2) IRad(T) < (2k - I)Rad(S), and

(3) A(T) = O(klSl/ k).

2

1.3 The complexity measures

This paper introduces weighted complexity measures analogous to the traditional time and

communication measures. We define the cost of transmitting a message over an edge e

as w(e). The communication complexity of a protocol 7r, denoted c,, is the sum of all

transmission costs of all messages sent during the execution of 7r. The time complexity of the

protocol 7r, denoted t,, is the maximal physical time it takes 7r to complete its execution,

assuming that the delay on an edge e varies between 0 and w(e). The classical complexity

measures correspond to the case where w(e) = 1 for all e E E.

Traditionally, communication protocols are evaluated in terms of E, V, D, which denote,

Lespectively, the number of edges, the number of vertices, and the unweighted (hop based)

diameter of the network. It turns out that it is convenient to evaluate the weighted com-

plexity of protocols using the "weighted analogs" of E, V, D, denoted by £, V, D, which are

defined as follows:

w(G) ( EEV = w(T) where T is an MST of G

D = Diam(G)

The analogy between these parameters and their unweighted counterparts is manifested

in the fact that E equals the total cost of transmitting a single message over all the edges of

the network, V is the minimal cost of reaching (or, disseminating a message to) all vertices,

and D is the maximal cost of transmitting a. message between a pair of network nodes.

In the sequel we express the complexity of our algorithms in terms of E, V and D. This

gives results that are conveniently similar in appearance to the results of the unweighted

case, as follows from the statement of results in the following subsection.

1.4 Problems and results

1.4.1 Global function computation

The problem: We are concerned here with computing global functions in a network. We

assume that the structure of the network is known to all the vertices (including the edge

weights). The only unknowns are the values of the n arguments of the function, which are

initially stored at different vertices of the network, one at each vertex. The outputs must be

be produced at all the vertices.

We restrict ourselves to the family of functions called symmetric compact in [GS861.

3

[ Global function computation ]I Communication (Time ]

[Upper bound O(V) [ (D)][Lower bound Q(V) [l(D)]

Figure 1: Lower and upper bounds for global function computation.

The functions f,, : X -- X in this family are symmetric (i.e., any two arguments can

h- switched) and compact, in the sense that the contribution of any subset of arguments

can be represented in "compact form" by a string of size log 2 IXI. The latter condition is

formalized by asmIming that there exists a function g : X 2 _-+ X such that for any k < n,f (Xi,X2 ... Xn) = g(fk(xI,X2 ... Xk),fn-k(Xrk+I,Xk+2 ... Xn)).

Computing such functions is quite a basic task in the area of network protocols. Many

functions belong to this family, e.g. maximum, sum, basic boolean functions (XOR, AND,

OR). Many other tasks, e.g. broadcasting a message from a given node to the rest of the

network, termination detection, global synchronizatioi, etc. can be represented as computing

a syrrmctfic compact function. A similar class of functions is considered in [ALSY88].

The results: We show that the computation of global functions requires O(V) messages

and O(D) time.

The upper bound is derived as follows. Define a spanning tree as shallow-light tree (SLT)

if its diameter is 0(D) and its weight is O(V). We then show that SLT trees are effectively

constructible, which implies that computing the value of our global function can be performed

(optimally) with O(V) messages and 0(D) time.

We are also concerned with efficient distributed constructions of SLT trees, or, in short,

SLT algorithms. We present a specific SLT algorithm that requires O(V -n2 ) communication

and 0(P . n 2 ) time.

1.4.2 Clock Synchronization

Problem: The purpose of the clock synchronization is to generate at each node a sequence

of pulses, such that pulse p at a node is generated after (in the "causal" sense [Lam78]) all

neighbors generate pulse p - 1.

As argued by Even and Raijsbaum [ER90], the relevant complexity measure here is the"pulse delay", which is the maximal time delay in between two successive pulses at a node.

4

Let us denote d = max(u,v)EE dist(u, v), i.e., d is largest distance between neighbors in thenetwork. Clearly d < W, and the problem is interesting when d < W. A lower bound of

Q(d), and an upper bound of O(W) are derived in [ER90]. (It is worth pointing out thatthe main emphasis of [ER90] is on somewhat different "directed" version of this problem.)

Results: In this paper, we show that one can achieve a pulse delay of 0(d.log2 n), i.e. leave

a gap of log 2 n between the lower and upper bounds. This result relies heavily on a numberof existing techniques, like the "Network Partition" of [AP91], and the "Synchronizer -Y" of

[Awe85a].

1.4.3 Network Synchronization

The problem: Asynchronous algorithms are in many cases substantially inferior in terms

of their complexity to corresponding synchronous algorithms, and their design and analysis

are more complicated. This motivates the development of a general simulation technique,

known as the synchronizer, that allows users to write their algorithms as if they are run

in a synchronous network. Implicitly, such techniques were proposed already in [Jaf80] and

[Cal82]. The first explicit statement of the problem was given in [Awe85a], and better

constructions for various cases were given in (PU89, APgOaI. Our goal is to extend the

concept of the synchronizer to the weighted case and provide an appropriate construction.

On a conceptual level, the synchronizer (as well as the controller, described in following

sections) is a protocol transformer, transforming a protocol ir into a protocol 0 that is

equivalent to 7r in some sense but enjoys some additional desirable properties. Recall that c,

and t, denote the communication and time complexity of the protocol 7r, and similarly forco and to. Our purpose is to guarantee that the transformation maintains co and to small

compared to c, and t,.

The synchronizer can be viewed as a way to remove variations from link delays in an

asynchronous network. In the "unweighted" case, this means tiiat we want to "force" all

link delays to be exactly 1. In the "weighted" case, the most natural and most useful

generalization of this concept is to force the delay on each link e to be exactly w(e). In a sense,

the synchronizer enables to simulate a "weighted" synchronous network G(V, E, w) with each

link e having a delay of exactly w(e) by a "weighted" asynchronous network G(V, E, w). Suchsimulations may be useful for various applications, for which the absence of variations in edge

delays significantly simplifies the tasks in hand, e.g., shortest paths [Awe89], constructing

routing tables [ABLP891, and others. However, in addition to simplifying protocol design and

analysis, synchronizers actually lead to complexity improvements for concrete algorithms.

5

For example, the algorithm SPT.Vh derived via a synchronizer (presented in Subsection

9.1), is the best known shortest path algorithm for certain values of V,D, E.

We define the amortized costs of a synchronizer (i.e., the overhead per pulse) in com-

munication and time as follows.

= - C,

t

At first sight, the clock synchronization problem from Subsection 1.4.2 seems to resem-

ble the problem of simulating an "unweighted" synchronous network G'(V, E) (with all link

delays being exactly 1) by a "weighted" asynchronous network G(V, E, w). The main differ-

ence is in the fact that the only goal of the network synchronizer is to simulate a particular

protocol, whereas the purpose of the clock synchronizer is to generate pulses. In general, it

would be ineffective to use clock synchronizers for network synchronization, and vice versa.

Even though the methods that we use to handle both problems have certain techniques in

common, the differences are quite substantial.

The results: We construct a synchronizer --y,, which is an analog of synchronizer -y of

[Awe85aJ, such that for any fixed parameter k,

Cp(-.) = O(kn. logn)

7I = O(logk n. logn)

1.4.4 Controllers

Problem: The controller [AAPS87] is a protocol transformer transforming a protocol 7r

into a protocol 0 that is equivalent to ir in terms of its input-output relation on a static

network, but is more "robust" than 7r in the sense that it has "reasonable" complexity even

if it operates on "wrong" data.

Results: In the unweighted case, [AAPS87] presents a controller guaranteeing to = c =

O(c' . log 2 c,). We show that the same bounds hold for the weighted case as well.

1.4.5 Connected components, spanning tree

The problems: The problems considered here are finding connected components and con-

structing a (not necessarily minimum) spanning tree jSegbA, AGPV89]. These pronlems are

6

Connectivity

Communication Time

DFS 0(e) 0(E)

CON ,ZOOd 0(e) 0(E))

CONhybrid O(min{ ,n. V}) 0(min{,n- ))

Lower bound P(min{C,n. }) fl(D)

Figure 2: Our Connectivity algorithms.

equivalent to each other.

The results: We show that performing any of the above tasks requires 0(min{,n • V})

communication by providing matching upper and lower bounds. To be more precise, we

prove that

1. For every distributed connectivity algorithm A and for any n there exists a family of n

vertex graphs G on which A requires communication complexity Q(n. V) and a family

of n vertex graphs G on which A requires communication complexity fQ(E).

2. There is a distributed connectivity algorithm with communication complexity

0(min{,n - VI}) on any graph G.

1.4.6 Constructing minimum spanning trees

F ;oblem: The minimum spanning tree (MST) of the graph G is a tree of minimum weight

spanning G.

Results: We develop a number of MST algorithms, based on modifications of the algo-

rithms of [GHS83, Awe87].

1. An algorithm with communication complexity 0(min{$ + V . logn, n - V}).

2. An algorithm with communication complexity O(' • log n log V) and time complexity

0(1). n • log n • log VK).

7

[Minimum Spanning Trees (MST)Algorithm Communication Time

( NST,,. [ O(E + V log n) O(E + V log n)NSTeentr O(n V) 0(n . Diam(MST)

NST1 ast O( - log n log V) O(V • n - log n log V).

l SThybrd [ O(min(,nYVlog n)) O(min{, n- logn')

Lower bound [ i(min{&, n- V}) [2(D)Figure 3: Our MST algorithms.

Shortest Path Trees (SPT)

Algorithm Communication Time

SPTcentr O(w(SPT) • n) = 0(n' -V) O(V . n)

SPTrecur O(£__ + ) 0(___ + ()

SPynch O(E + T) kn .log n) D -logk n log n

SPThybhrd O(min{i + D -kn - log n),9 1 +'}) O(D+f)

Lower bound Q2(min{,n.V}) (D)

Figure 4: Our SPT algorithms.

1.4.7 Constructing shortest path trees

Problem: The shortest paths tree (SPT) of the graph G with respect to a source vertex

s E V is a tree defined by the collection of shortest paths from s to all other vertices in G.

Results: We develop a number of SPT algorithms:

1. An algorithm with communication complexity O(Sl + ' ) and time complexity O(D' + ' ).

This is analogous to the result of [Awe89], which achieves same result for the unweighted

case.

2. An algorithm with communication complexity O(" + D. kn -log n) and time complexity

O(D. log k n log n:).

1.5 Structure of this paper

The paper proceeds as follows. Section 2 gives tight upper and lower bounds on the compu-

tation of global functions. Section 3 contains clock synchronization algorithms. In Section .1,

we give upper and lower bounds for network synchronizers. Section 5 deals with controller

algorithms. Section 6 discusses basic algorithmic techniques for network problems such as

broadcast, depth first search and construction of minimum spanning trees and shortest path

trees. ection 7 discusses the problems of broadcast And constructing connected components

arid spanning trees. Finally, Sections 8 and 8 describe efficient algorithms for constructing

imniniuin spanning trees and shortest path trees, respectively.

2 Optimal computation of global functions

2.1 The lower bound

Theorem 2.1 The computation of global symmetric compact functions requires f)(V) commu-

nication and S(D) time.

Proof: Suppose that the value of the function has been computed at the vertex v. Sin

the value of a global function depends on the value of all of its arguments, there must be

some information flow from each of the vertices to v. Thus the subgraph G'(V, E'), defined

by the set of cdgcs E' traversed by messages of the protocol, must contain a path from v to

any other vertex in V, i.e., it must contain some spanning tree of G.

Observe that the distance dist(u,v,G') from v to any other vertex u E V is a lower

bound on time complexity of the protocol. Picking a pair of vertices u,v realizing D (i.e.,

maximizing the distance dist(u, v, G)) and noting that dist(u, v, G') >_ dist(u, v, G) = D, we

get that D is a lower bound on the time complexity of the protocol.

Furthermore, the total weight of the edges of G', w(G'), is a lower bound on the com-

munication complexity of the computation. Now, since G' contains a spanning tree of G, its

total weight satisfies w(G') > V. Thus V lower bounds the communication complexity of

the protocol. 1

2.2 The upper bound

It is easy to see that given a spanning tree 7' for the network, a global function can be

('oiputed with communication complexity w(T) and time complexity Diarrm(T). Clearly,

any shortest path tree Ts has small dcpth, namely Diam(Ts) = O(E), but its weight may

be as big as w(Ts) = S)(n • V) [BK,831. Analogously, any minimum spanning tree 7iM has

small weight, namely w(TM) = V; but its depth may be as high as Diam(7Ti) = Q(11 D)

9

[IIKJ83J. Th'lu ill [BKJS3, JafS5] it is advocated to approach such problemns by attemipting

to construct a tree approximating both a shortest-path tree and a minii iu- weight spanning

tree.

Recall that a spanning tree is shaf low-light tirce (SIX) If its diamneter is 0('D) and its

weight is 0(V). Such trees ii iiiC z(siiiiuiltati(eotsly 1)oth weight and depth; existence of such

tree Wou~ld imply that Mi any graph, ont, ca;n comrpute global functions with communication

compiflexi ty C)(V) and 0(ED) time. However, it Is not clear that such trees exist. InI the next

subsection we e stablish

'rheoreni 2.2 Every graph has a shallow-light spanning tree.

Corollary 2.3 The computation of global symmetric compact functions can be performed with

communication complexity 0(V) and 0('D) time.

True shallow-light tree algorithnm

irC ext, provide itn algori t h n ( hereafte'r recferredl to as the S UY agori thrIi ) for conistructinug

ani S LI for- ani arbitrary graph, thus proving Ilmeoreiii 2.2.

l. ('onstruct art MSIF '/,% andf an SPJ' 7.,; for (,', rooted at ali arbitrary vertex v'o*

2. T'raverse I\',% 'in a (lepthli-first search (I) S) fashion, starting front v0. We think of

h,' D) S Ls carried on t, ly a "tok'en", refpre-selti ng thle algorith in's center of acti vitv.

Observe that in the tour takenm iy this token through the tree according to the specifi-

cations of the D)1S, each tree edge is t raversedl exactly twice. D~efi ne the "mileage" of

the l)FS token at a given) tine to be the niumrber of steps (forward and b~ackward tree

edlge traversals) uip to this tillie. Denote by i'(i) (0 < I < 2(nr - I )) the location of the

I) S token at. the time Its mnileage Is exactly i. KVo example, v(0) =v(2n - 2)= o

where v() Is the source of the I )1S.

:3. ( oust rmct the "!i ie- version 1L of '1 t , whichi Is a (weighited) path graph containin g

vertIces 0, 1,. 211 - 2. A vertex ion I hie pathl corresponds to vi' ). We+ assign each

edlge ( -_ (I I I) onl the linet. L tlie we.ighlt of trime corresponiniig edge (v(1 ), v(1 -+ I1)) InI

he graph (;,. ( hjserve t hat. nieighboiring vertices on tb ln i ne 1, crrespond to neighbhors

iIi. Observe t hat t. 1 v tti llwih of t. he inIe is at mlost twice the total wveight of

the MII, 7,1, iLe., i'( L) < 2V.

~1. F'ix a p~aramieter q ;0.( 'oustrmict. "bre ak points" 1, onl 1, by scull iig ii fromn left to

right according to the following ruiles.

Construct a minimum spanning tree TM for G.

Construct a shortest path tree Ts for G.

Construct L based on TM as described above.

Assign each edge e of L the same weight as v(e) in G.

E'- TMX.-O;Y4-O

repeat

repeat Y +- Y + 1

until dist(X, Y, L) > q . dist(X, Y, Ts)

E'- E'U Path(X,Y, Ts)

until Y = n

Construct a shortest path tree T in G' = (V, E')

output T

Figure 5: The SLT algorithm

(a) Break-point B, is vertex 0 on the line L.

(b) Break-point Bj+1 is the first point to the right of B such that

dist(B, Bi+1 , L) > q. dist(v(B,), v(Bi+,), Ts),

meaning that the distance from Bi to B1+i in TM exceeds that in Ts by a factor

of at least q.

5. Create a subgraph G' of G by taking TM and adding Path(v(Bi), v(Bi+i),Ts) for all

break-points Bi, i > 1.

6. Construct a shortest path tree T rooted at v0 in the resulting graph G'.

7. Output T.

The algorithm is presented formally in Figure 5. In the algorithm, T denotes the set of

edges selected to the shallow-light tree and X, Y are pointers to nodes on the line L. An

example for a possible execution of the algorithm is given in Figure 6.

2.3 Analysis

Lemma 2.4 The tree T constructed by the algorithm satisfies uw(T) <_ (1 + 2)V.

11

Ts

LBt B 2 t Y

Figure 6: An example run of the SLT algorithm

Proof: The tree T is a subgraph of G', created by adding the paths Path(v(B), ,(Bi+i), Ts),

for I > 1, to TM. Therefore

w(T) < w(G') = w(7Ij) + E w(Path(v(B), v(Bi+,), Ts)).1>1

But by choice of the breakpoints Bi,

1w(Path(v(B), v(Bi+i), Ts)) = dist(v(Bi), v(Bj+i), Ts) < -dist(Bi, Bj+I, L),

q

hence

1 1 1w( Path(v(Bj), v(Bi+j), 5 )) < - 3dst(Bi, B,+1 , L) _ - . w(L) < -2V,

q ,>1 q q

and thus w(T) :_ (1 + 2)V. Uq

Lemma 2.5 The tree T constructed by the algorithm satisfies Diam(T) < (q + 1)D.

Proof: Consider an arbitrary vertex x G 1". We need to show that its depth in 7' is at most

(q + 1)D. For that it suffices to bound di-st(vo, x, G'). Let j denote the point corresponding

to x on the line L. Suppose that Bt < j < Bt+1 , i.e., j occurs on the line L between Bt and

B,+1 for some f. Since Path(v(Bi), v(B,+), Ts) is included in G' for every 1 < i < f - 1, it

follows that the entire path Path(v(Bi), v(Be), Ts) connecting the nodes corresponding to

B, and Bt in Ts is included in G', and hence

dist(vo, v(B,), (') < D.

If = Be then we are done. Otherwise, by the choice of B+1 ,

dist(IB,j, L) < q ,i st(v(B), x, Ts).

l'ut together, we get that, dis(vo, .r, (;') < (q + I)D. I

Corollary 2.6 The tree ' constructed by the algorithm is an SLT.

12

2.4 Distributed construction of shallow-light trees

Theorem 2.7 There is a distributed algorithm for constructing an SLT requiring O(V • n2)

communication and 0(V . n 2) time.

Proof: By Subsection 6.3, the MST TM can be constructed using Algorithm MSTce,,tr with

0(n V V) communication and 0(n 2 . E)) time.

Now, the rest involves stretching the MST into a line. By Fact 6.3, the total weight

of the MST is at most n - 1 times the diameter, or, V < (n - 1)D. Thus the time andcommunication of the main body of the SLT algorithm are both 0(n 2 . 1?).

Finally, we need to compute one more SPT in order to get the final tree T out of the

subgraph G'. This is done using Algorithm SPTcentr of Subsection 6.4, and costs us additional

0(1?. n) time and 0(n 2 V V) communication.

Overall, the algorithm invests 0(1?. n 2 ) time and O(V. n 2) communication. I

3 Clock synchronization

In this section we describe three methods of clock synchronization, called synchronizer a*,/3* and -Y*. These are modifications of synchronizers a, 3 and -Y of [Awe85a].

3.1 Clock synchronizer a*

As pointed out in [ER90, the most natural approach to clock synchronization is to use the

following synchronization mechanism, called synchronizer a*.

Pulse generation: whenever a node generates pulse p, is send messages to all neighbors,

and when it receives messages of pulse p from all neighbors, it generates pulse p + 1.

This method clearly requires time proportional to the highest edge weight, namely O(W).

Our goal is to approach the lower bound, which is 0(d) (recall that d is largest distancebetween neighbors).

The naive way to improve the delay is to construct a shortest path Path(u,v) for all(u, v) E E and to communicate with each neighbor over such path. The problem with this

method is that a particular edge may belong to many paths (up to E), and thus the resulting

congestion will slow down the communication time by the corresponding factor (up to E).

13

3.2 Clock synchronizer /3"

In order to minimize congestion, we may try the following method, called synchronizer '3.

Preprocessing: We construct a spanning tree T of the network, and select a "leader" to

be the root of this tree.

Pulse generation: Information about the completion of the current pulse is gathered ,n

the tree by means of a communication pattern referred to as convergecast, which is started

at the leaves of the tree and terminates at the root. Namely, whenever a node learns that it

is done with this pulse and all its descendants in the tree are done with it as well, it reports

this fact to its parent. Thus within finite time after the execution of the pulse, the leader

eventually learns that all the nodes in the network are done. At that time it broadcasts a

message along the tree, notifying all the nodes that they may generate a new pulse.

The time complexity of Synchronizer /* is fl(E), because the entire convergecast and

broadcast process is performed along a spanning tree, whose depth is at least the diameter

of the network.

3.3 Clock synchronizer 7"

Our final synchronizer, called synchronizer -y*, combines synchronizer -y of [Awe85a]) with

the network partitions of [AP91, AP90b].

Definition 3.1 Given an n-vertex weighted graph G(V,E,w), a tree edge-cover for G is a

collection M of trees, such that

1. every edge of G is shared by at most O(log n) trees of M,

2. the depth of each tree in M is at most O(log n -d), and

3. for each edge, there exists at least one tree containing both endpoints.

Lemma 3.2 For every n-vertex weighted graph G(V, E,w), it is possible to construct a tree

edge-cover.

Proof: The desired collection of trees can be constructed as follows. Apply Thm. 1.1 to

the graph G, with the initial cover taken to be S = {Path(u,v,G) I (u,v) E E}, and the

parameter k = log n. The tree edge-cover is now constructed by selecting a shortest path

14

spanning tree for each of the clusters of T. The desired properties are guaranteed by the

theorem (noting, in particular, that Rad(S) < d and ISI = n, and therefore the output cover

T satisfies Rad(T) = O(dlogn) and A(T) = O(logn)). I

Preprocessing: Construct a tree edge-cover for G. Inside each tree, a leader is chosen to

coordinate the operations of tree. We call two trees neighboring if they share a node.

Pulse generation: The process is performed in two phases. In the first phase, Synchro-

nizer /3 is applied separately in each tree. Whenever the leader of a tree learns that its tree

is done, it reports this fact to all the nodes in the tree which relay it to the leaders of all the

neighboring trees. Now, the nodes of the tree enter the second phase, in which they wait

until all the neighboring trees are known to be done and then generate the next pulse (as if

Synchronizer a* is applied among trees). More details will be given in the full paper.

Complexity: The "congestion" caused by the fact that messages of different trees cross

the same edge, adds at most an O(log n) multiplicative factor to the time overhead. Since the

height of each tree is O(dlog n), it follows that the time to simulate one pulse is O(d. log 2 n).

4 Network synchronizers

4.1 Construction outline

The synchronizers discussed in this section operate by generating sequences of "clock-pulses"

at each vertex of the network, satisfying the following property: pulse p is generated at avertex only after it receives all the messages of the synchronous algorithm that arrive at that

vertex prior to pulse p. This property ensures that the network behaves as a synchronousone from the point of view of the particular synchronous algorithm.

The problem arising with synchronizer design is that a vertex cannot know which mes-

sages were sent to it by its neighbors and there are no bounds on edge delays. Thus, the

above property cannot be achieved simply by waiting "enough time" before generating the

next plse, as may be possible in a network with bounded delays. However, it may be

achieved if additional messages are sent for the purpose of synchronization.

In the unweighted synchronizers of [Awe85a], incoming links are "cleaned" from transient

messages in between any two consecutive pulses, similar to the clock synchronizers in Section

15

3. In our (weighted) case, this would be very inefficient since cleaning the links requires

time proportional to the maximal link weight W > 1, which would therefore dictate the

multiplicative overhead of the synchronization. The idea for overcoming this problem is that

links of high weight should be cleaned less frequently, thus enabling to amortize the cost of

cleaning them over longer time intervals.

Our synchronizer, denoted , is also a modification of synchronizer -Y of [Awe85a].

Synchronizer y is a combination of the two simple synchronizers a and /3, which are, infact, generalizations of the techniques of [Ga182]. Synchronizer a is efficient in terms of

time but wasteful in communication, while synchronizer /3 is efficient in commu, :cation but

wasteful in time. However, we manage to combine these synchronizers in such a way that the

resulting synchronizer is efficient both in time and communication. Before describing these

synchronizers, we introduce the concept of safety for the weighted case.

Definition 4.1 A message sent from a vertex v to one of its neighbors u over the edge e = (v, u)

at pulse q is said to affect a later pulse p if q + w(e) _< p. A vertex v is said to be safe with

respect to pulse p if each affecting message of the synchronous algorithm sent by v at earlier

pulses has already arrived to its destination.

Each vertex eventually becomes safe w.r.t. a pulse p some time after sending all of its

messages from earlier pulses. If we require that an acknowledgment is sent back whenever

a message of the algorithm is received from a neighbor, then each vertex may detect that itis safe w.r.t. pulse p whenever all its affecting messages have been acknowledged. Observe

that the acknowledgments do not increase the asymptotic communication complexity, and

each vertex learns that it is safe w.r.t. pulse p within constant time after it executed its

pulse p - 1.

A new pulse p may be generated at a vertex whenever it is guaranteed that no affecting

message sent at the previous pulses of the synchronous algorithm may arrive at that vertex

in the future. Certainly, this is the case whenever all the neighbors of that vertex are known

to be safe w.r.t. pulse p. It remains to find a way to deliver this information to each vertex

with small communication and time costs.

We need to state a number of definitions first.

Definition 4.2 Given a synchronous protocol 7r running on a synchronous weighted network

G(V, E,w), we say that 7r is in synch with G if 7r transmits a message on edge e only at times

that are divisible by w(e).

Definition 4.3 A weighted network G(V, E, ?v) is said to be normalized if all weights w(e) are

powers of 2.

16

Informally, our solution proceeds according to the following plan.

1. Design a synchronizer for normalized networks and protocols that are in synch with

the networks on which they are run.

2. Show that one can transform an arbitrary synchronous protocol ?r and synchronous

network G, so that the above assumptions are satisfied, without significantly increasing

the complexities.

These two steps are described in the following two subsections.

4.2 Synchronizer 7,,

We assume now that the weights of all network edges are powers of 2, and messages are sent

on an edge of weight 2' only at times divisible by 2'.

Let 6 = log W. We define a collection of sub-networks {Gj(V, E) 1 0 < i < 6}, by defining

E to be the set of edges whose weights are divisible by 2'. (Note that an edge e with weight

w(e) = 2j occurs in all graphs G, for j >_ i.)

The idea is that pulses divisible by 2' are handled by a so-called synchronizer yi, which

is exactly synchronizer -y of [Awe85a], applied to the graph Gi. The synchronizer 7Y treats

pulse p. 2' as "super-pulse" p. It guarantees that super-pulse p is executed only after all

messages sent along edges in E, at super-pulse (p - 1) have arrived.

A vertex has to satisfy all 6 synchronizers in order to proceed with a pulse. More specif-

ically, consider a pulse p = 2i. (2r + 1), i.e., such that 2j is the maximal power of 2 dividing

p. Then pulse p is postponed until super-pulse (2r + 1) • 2j - ' of synchronizer -Y is executed.

For example, pulse 24 = 3 • 23 is completed only after the synchronizers Yo, Yi, -t2, and -y3

are done carryng their pulses 24, 12, 6 and 3, respectively.

Lemma 4.4 Synchronizer 7,, is correct.

Proof Sketch: We need to show that under synchronizer -t., a vertex v generates pulse

p only after receiving all messages it would receive by pulse p were the protocol executed

oii a synchronous network. This follows from the fact that the set of messagev i+ "I-'^ld g-t

by pulse p, i.e., the set of messages affecting this pulse, includes messages sent on edges

belonging to Gi sent at pulse p - 2', and the arrival of these messages is guaranteed by

synchronizer yi. U

17

4.3 Designing the protocol transformation

In order to justify the assumptions of the previous subsection we need to prove the following

claim.

Lemma 4.5 Given a synchronous protocol 7r running on a synchronous weighted network G(V, E, w),

there exist a synchronous protocol 7r' and a synchronous network G'(V, E, w') with the following

properties:

1. G' is normalized.

2. The protocol 7r' is in synch with G'.

3. The output of 7r' on G' is identical to the output of 7r on G.

4. ':.c L;.,e and i.oomunication complexities of a run of 7r' on G' are at most twice higher

than the complexities of the corresponding run of ir on G.

The lemma is proved throughout the rest of this subsection. We proceed as follows.

Consider a message M transmitted by 7r on the edge e = (u, v) with weight w(e) = w. For

this message we define the following quantities.

SM: the time by which M is sent by u.

RM: the time by which M is received by v.

PM: the processing tie of M at v, which is the first timc (at v) by which the contents of

that message might actually be used, i.e, by which the vertex program at v behaves

differently depending on the contents of M or the fact that M has not been sent.

Clearly, SM + w = RM < PM. However, without loss of generality, we can assume

PM = RM for the protocol 7r.

Step 1: Transform 7r into a synchronous protocol r' slowed down by a factor of 4, i.e.,

where events that happen at time t at 7r now happen at time t' = 4t at 7r'. That is, for any

message M, the sending, receiving, and processing times S R'M, Pf in 7r' are shifted as

follows.

SM =4SM

R'M = S' +w=4SM+w

PMf = 4 RM = 4 (S5M + 4w)

18

Observe, however, that edge delays are not stretched by a factor of 4, and hence the

arrival time R'M of M in 7r' clearly precedes its processing time Pf. This is not a problem,

since the message can be kept in an edge buffer and be effectively ignored until time t' + 4w.

This implies that the artificial increase in the delay of edge e from w to any value below 4w

is not going to affect the protocol. This explains the next step.

Definition 4.6 Let power(w) denote the smallest power of 2 that is larger than or equal to w,

i.e., power(w) = 2 fogw ] .

Observe that w < power(w) < 2w.

Step 2: Instead of running -' cn G, run r,' on the network G(V, E, tb), where tb(e)

power(w(e)). The sending, receiving, and processing times S, k , P"' in 7r' on G are

km = S + power(w)

15 ; = Pf

Next, it is necessary to guarantee that messages are sent on e only at times divisible by

power(w).

Definition 4.7 Define next,,,(t) as the first time after t that is divisible by power(w).

Observe that t < next,(t) < t + (w - 1).

Step 3: Modify 7r' to obtain a new protocol 7r", running on G, where for the above message

M, the sending, receiving, and processing times S~t, R%1, PZ, in 7r" all shifted as= next, (SM)

R = S+ power(w)

P = P f

The main difference between r' and 7r" is that the transmission of M in yr" may be delayed

by at most w. This does not cause a problem, because the message M is being ignored until

time P = 4w + St at the receiving end; thus PK, - RifM = S" + power(w) still holds.

4.4 Complexity

Lemma 4.8 The synchronizer -yw described above has the following complexities:

Cp(",,,) = O(k.n.logW)=O(k.n.logn)

TP(Y,) = O(logkn log W) = O(log k n . logn)

19

Proof: Synchronizer -y is invoked on the graph Gi once every 2' time units. This costs us0(2 • n . k) in communication and 0(2' log k n) time. This waste is amortized over 2' time

units, and then summed over all 0 < i < log W graphs Gi. I

5 Controllers

Controllers are applicable in situations where one suspects the possibility that errors inthe input data, or processor faults, may cause a given protocol 7r to diverge away fromits 3pecification. The danger is that while the protocol may have firm bounds c, and t,on its communication and time complexity under normal circumstances (i.e., for correctexecutions), it may waste valuable resources in an uncontrolled fashion once it diverges atsome processors in the network. The task of a controller is to transform the protocol 7r into

a "controlled" protocol 0 whose semantics is identical to that of the original 7r under correctinput but whose complexities are reasonably bounded even on incorrect input.

We consider here the same model as in [AAPS87]. The protocol starts at a certain vertex,called the initiator, and vertices enter the protocol as a result of receiving a message of the

protocol. This model is referred to as "diffusing computation" by Dijkstra and Scholten[DS80J. It is worth mentioning that it is easy to extend the case of a single initiator to the

general case of multiple initiators.

Suppose that at the time a vertex receives the first message of the protocol, it marks

the edge over which the message has been received. It is easy to see that the collecticn of

marked edges forms, at all times, a dynamically growing tree rooted at the initiator vertex.

This tree is called the execution tree of the protocol.

Our purpose is to control the growth of this tree, namely, to guarantee that not toomany messages are sent during the protocol, without affecting the "correct" executions ofthe protocol.

Towards this goal, the MAIN CONTROLLER of [AAPS87] views every message sent by the

protocol as consuming one unit of some abstract "resource". The protocol must authorizeevery single consumption of the resource. That is, a vertex that wants to consume a resource

unit (i.e. send a message) must first send a special "request" up on the execution tree, andthen wait to get a special "grant" message, before the resource may actually be consumed.

The most naive (but inefficient) way of controlling the amount of resource units consumed

in a growing tree is to request the root to authorize the consumption of each resource unit.

That is, for each resource unit consumed, a request has to go up on the execution tree

20

towards the root. Once this request reaches the root, the root increases its "permit counter"

by 1. In case this counter is still less than or equal to some "threshold", it sends back a

permit, which authorizes the consumption of the appropriate resource. Otherwise, if tile

counter has exceeded the "threshold", execution is suspended and the execution tree stops

growing.

In our case, we need to set this threshold to the value of c,, which is the complexity

of 7r in a correct execution. Thus the naive controller above will not interfere with correct

executions; its only effect is to stop executions that are obviously incorrect.

The more efficient algorithm of [AAPS87] is similar to the naive controller above in that

the resource requests are propagated up to the root and the permits are distributed from

the root. The algorithm of [AAPSS7] takes advantage of the fact that a single message

can represent a number of resource requests or a number of permits. The main idea is to

aggregate a large number of requests/permits together, represent them by a single message,

and allow this message to travel up the tree for a distance proportional to this number.

Permits are kept not only at the root, but. also at intermediate vertices. The root keeps

an approximate permit counter, so that the actual number of resource units consumed is at

most twice the value of this counter. For this reason, even though the root threshold remains

the same as in the naive controller above, the guaranteed upper bound on the number of

protocol messages sent will be twice greater than in the case of the naive controller, namely

2c,. Again, the algorithm does not interfere with correct executions of the protocol.

It is shown in [AAPS87] that the communication overhead of the authorization mechanism

(namely, permit/grant messages) results in at most log 2 c, control messages traversing a given

edge of the execution tree. It follows that the total overhead of control messages, as well the

total complexity of the resulting protocol 0 is co = O(c, log 2 cr).

When running this protocol on a "weighted network", we consider a transmission of a

message on an edge w as a rcquest tn consume w(e) units of the resource. Essentially, this

is equivalent to running the same algorithm as in [AAPS87] on the "unweighted" version

C = (V, E, tb) of the network, where an edge e is substituted by a path containing w, edges

and w, - 1 "dummy vertices", and weight tb(e) = 1 for all edges e. Since the results of

[AAPS87] do not depend on the type of resource being consumed, we deduce:

Corollary 5.1 The CONTROLLER protocol given in [AAPS87] transforms an arbitrary pro-

tocol r into an equivalent "controlled" nrotocol 0 whose (weighted) communication and time

complexities are co = O(c, log2 c,) and to = O(c, log 2 c,), respectively.

21

6 Basic algorithmic techniques

In this section we briefly describe several standard network algorithms and state their coM-

plexities in the weighted setting. These algorithms will be used as basic components in the

more involved algorithms to be presented later.

6.1 The flooding algorithm CONflood

The goal of the flooding algorithm CON '1,,,j (cf. [SegS3]) is to broadcast a message throughout

the network. This is done as follows. Each vertex that receives the message for the first ti (

forwards it further to all its neighbors. Future arrivals of the message are ignored.

Fact 6.1 Algorithm CON. 1c,,i has communication complexity O(S) and time complexity 0()D).

6.2 The depth-first search algorithm DFS

The goal of the )FS algorithin DFS (cf. [Eve79, Awc83 .b]) is to traverse the network in

(hel)th-first order. in order to make use of this algorithm as a component in the algorithms

described later, it is necessary to modify it as follows. At any time, the algorithm maintains

estimates of the total cost of all the edges traversed so far. Such estimates are kept both at

the center of the activity and at the root, and are called the root estimate ESTR and the

rrntcr estimate ESTF,, respectively. The estiniates are updtated as follows.

1. Each t imie an edge is traversedI, its weil t is added to the center estirmate ES'T..

2. The root estimate E.,7/? is updated only whenever the center of activity is about to

traverse an e(lge that will cause the center est inate EIT5( to double compared to the

current value of E.5'Ti. The update is done via a message from tie center of activitv

to the root, which sets ES'TI, to be the new value of EST(..

Thus. the center estimate is the tot .! S'ri! of edo woilht, for all edges traversed so far,

while the root et inate is a lower bound on the total edge weight for all the edges traversed

so far, plus the next edge to be traversed. Moreover. this lower bound is within a factor of

two of the real value.

Observe that, going to the root whenever th 'o estinate is douhled can at most. douIle

the communication complexity, as this complexity can he viewed as the suini of a geomnetric

progression. This establishes the following fact.

22

Fact 6.2 Algorithm DFS has communication complexity O(S) and time complexity O(E).

6.3 The full-information miiinmum spanning tree algorithm MSTC,,r

[lie full-informat ion MST algorithm MSTcet, is similar in structure to Priin's MST algorithm

(cf. [ve79]). This algorithm proceeds in stages, each selecting the minimum-weight edge

coillecting a rex in the tree with a vertex outside the tree, and adding this edge to the

tre'. Ihe algorit l1r1 terliiinates when the tree spans all the vertices. The resulting tree is an

NISI of the network. It remains to show how to imlplernent each stage.

Throughout the execution, the invariant maintained by the algorithm is that each vertex

in the tree knows the structure of the whole tree. To preserve this invariant, whenever a new

vertex is added to the tree, its name is broadcast on the tree, so it reaches all other vertices.

This requires only one message on each tree edge for each new vert(x added

A\s in algorithm DFS, we maintain at the root an estimate on the total weight of all tree

d ,s, called the root estinatc. Observe that in this algorithm, the root estimate is precise,

as the root knows the structure of the whole tree.

Il order to analyze the algorithm we need the following fact, which gives an upper bound

ol the dialneter of all MS'T.

Fact 6.3 For any minimum spanning tree 7' of G, Diarn(T) < V < (n - 1)1.

Proof: The first inequality is trivial. For the second, consider an edge e = (I'l, 1'2) E 7.

By definition of the MST, this edge induces a partition of V into two subscts of vertices,

V = Vi U V, such that vi E V1, v2 E V2 and e is the minimum weight edge connecting a

vertex in V with a vertex in V . It follows that e is a shortest path from i'l to v2, since any

other path from o, to v2 contains at least one edge with one endpoint in V, and the other

elilIpoint in V2. Thus wv(c.) < Diain(G) D V, and hence

V = w(T) - > w(c) < (n - 1)D.rET"

IEach phase of the algorithin MSTC,,tr const ructing a tree 7' requires O(V) communication

aii O(i)iam(7')) time. There are exactly ni - I phases. (onsequently we have

Corollary 6.4 Algorithm MSTr-jr has communication complexity 0(n . V) and time complexity

o()i,{n,,v, T ).

23

6.4 The full-information shortest path tree algorithm SPTcentr

The full-information SPT algorithm SPTcetr is in fact a distributed implementation of Di-

jsktra's algorithm (cf. [Eve79, Ga182]), computing a shortest path tree rooted at a source S.

This algorithm is very similar to the MST algorithm MSTcentr described above. The algorithm

proceeds in phases, each adding one more vertex to the tree. The tree vertices know the

structure of the whole tree.

The vertices outside the tree are labeled. The label of a vertex x is the minimum, over all

its neighbors y in the tree, of dist(s, y) + w(y, x). In each phase, the vertex with minimum

label is added to the tree. Once this vertex is chosen, its name is broadcast over the whole

tree.

Again, we need a basic fact giving an upper bound on the total weight of a shortest path

tree in order to analyze the algorithm.

Fact 6.5 For any vertex s and any shortest path tree 7 for s in G, w(T) <_ (n - 1)V.

Proof: Consider an edge e = (u,v) E T. Let T' be an MST of G, and let P denote

Path(u, v, T'), the path connecting u with v in T'. By the definition of an SPT,

w(e) _< w(P) < w(T') = V.

Thus,

w(T) = w(e) (n - 1)V.eET

Each phase of the algorithm SPTc,,t, constructing a tree T requires O(w(T)) communi-

cation and O(D) time. There are exactly n - I phases. Consequently we have

Corollary 6.6 Algorithm SPTcentr has time complexity 0(n7 ) and communication complexity

0(1 2v).

7 Connected components and spanning tree construc-

tion

In this section, we prove matching upper and lower bounds on the communication complexity

of performing the tasks of finding connected components and constructing a spanning tree.

2,1

Eb

1 2 3 4 5 6 7 8 9

Figure 7: The graph G9

7.1 Lower bounds

Let us first point out that an Q(E) lower bound on communication is given in [AGPV89] for

the case where all edge weights are unity. In the rest of this subsection, we prove an Q(n- V)

lower bound on the communication complexity.

Consider the family of graphs G,, = (V,E,w) defined as follows. V {a,...,n}. The

set of edges is composed of two subsets, E = Ep u Eb, where the first subset creates a

path, Ep = {(i,i + 1) 1 1 < i < n - 1}, and the second subset consists of bypassing edges,

Eb = {(i, n + 1 - i) 11 < i < n/2}. The weights are defined as

w(e)= X, e EEp,3C4, I e E Eb,

where X is some large value, say X > n. Figure 7 depicts the graph G, for n = 9.

Note that the MST for G is the subgraph (V, Ep) based on the path alone, so V = nX.

We make some assumptions similar to those of [AGPV89] regarding the model. In par-

ticular, we assume that the only operation one can do with ID's is comparisons; this can be

extended also to general operations in case the ID's are allowed to be sufficiently large.

In addition, let us formulate the following concepts. Note that every vertex in our graphs

has two or three neighbors. Let us assume that each vertex maintains its own id in register

R1 arid the id's of its neighbors in input registkrs R,, R2 and R3 (if needed). The only things

that can be done with these id's is comparing them to each other and to other id's. The

vertex does not distinguish between the registers. In particular, it cannot tell which register

contains the id of its neighbor along the bypassing edge. However, in order to enable us to

speak about this particular register, let us refer to it also as the bypassing register, R8 .

Messages may include vertex id's, as well as information about the relationships between

25

id's, e.g., "the contents of register R1 at vertex id 0 is the id 0'." A vertex i may know any

other vertex j only by its id, ¢(j). When we say "vertex i obtains the contents 0' of register

Rk of vertex j" we mean that at some stage of the run, i learns the fact that register Rk at

the vertex whose id is O(j) contains the id 0'. This can happen either by i being itself j or

by its getting a message with that statement. Similarly, when saying "vertex i obtains the

id of vertex j" we mean that at some stage of the run, the id O(j) becomes available to i,

i.e., either i is itself j or it gets a message containing ¢)(j) (as the id of a vertex, i.e., as the

contents of register R, of some vertex).

Let A be a deterministic algorithm that succeeds in computing a spanning tree on every

input graph and whose communication complexity is f(n) = o(n 4 ). In particular, this means

that there exists a constant no such that for every n > no, the algorithm A completes the

construction of tree on G,, with communication cost less than n*. Clearly, then, the algorithm

does not send any messages over any bypassing edge in these graphs, since using such an

edge immediately incurs a cost of n 4. Henceforce we restrict attention to graphs Gn for

n > no.

Our proof is based on the following lemma.

Lemma 7.1 In the execution of A on G,, for every 1 < i < n/2 there is some vertex j suchthat one of the following two events must happen.

1. j receives both the id of i, 0(i), and the id stored in (n + 1 - i)'s bypas:ing -egister RB.

2. j receives both the id of n + 1 - i, 0(n + 1 - i), and the id stored in i's bypassing register

RB.

Proof: Suppose that there is some 1 < i < n/2 for whom the lemma does not hold. Consider

the graph G' obtained from G,, by adding two vertices v, w and replacing the edge (i, n + 1 -i)

with the two edges (i, v) and (n + 1 - i, w), both with weight X 4. Figure 8 depicts the graph

Gn for n =9 and i =3.

Select the id assignment 0(k) = 2k for k E V, and the id's 0(v) = 2(n + I - i) - 1 and

O(w) = 2i - 1.

Consider the run of A on the graph G' . We claim that the executions of A on Gn and

G' are similar. This is proved inductively on the length of the runs, noting that as long

as no messages are sent over the edges connecting u and w to the rest of the graph, any

comparison made by any vertex has the same result except a comparison of the id of i to

the contents of the bypassing register RB of n + 1 - i, or a comparison of the id of n + 1 - i

to the contents of the bypassing register RB of i. Since no vertex holds both these values,

26

Eb

1 2 3 4 5 6 7 8 9

3' 7'

Figure 8: The graph G3

the runs will remain similar. This implies that in G', the vertices v and w will not join the

spanning tree, contradicting the assumption that A operates correctly on every graph. I

Lemma 7.2 Algorithm A requires Q(nV) messages. I

Proof: Consider the execution of A on G, and pick some 1 < i < n/2. It follows from the

previous lemma that messages containing the names i or n + 1 - i were passed over at least

n + I - 2i edges during the run. Thus the message complexity of A is at least

n/2

X Z(n + 1 - 2i) _ n 2X/4 = Q(nV).

7.2 An upper bound

Claim 7.3 Algorithm CONhyb~id (presented below) requires O(min{9,n. V}) messages.

We now describe Algorithm CONhvbrid, whose communication complexity is the minimum

between those of the algorithms DFS and MSTc,,r presented above. In effect, algorithm

CONhvbrid runs algorithms DFS and MSTc,,,tr in parallel. The idea is that the root vertex"controls" both algorithms and suspends the more expensive one. Towards this goal, the

root maintains the variables W, and Wb which are the root estimates of algorithms DFS and

MSTcetr, respectively. In addition, the root maintains the variable Permit which takes values

{ DFS;STntr} and is updated whenever W or W are updated at the root. As a rule,

Permit { DFS, W. < Wb,

MSTcentr, otherwise.

27

When Permit = DFS, algorithm DFS is running and algorithm CONc,,,,, is suspended, and

vice versa.

Observe that it is easy to suspend either of the two algorithms, as algorithm DFS (re-

spectively, MSTcntr) needs to be suspended only when W (resp., Wb) is increased. At that

moment, the center of activity of algorithm DFS (resp., MSTc,,tr) is located at the root, so

the algorithm can be suspended by requesting that the center of activity stays at the root.

Since at any time the root estimates are within a factor of two of the actual commu-

nication costs of both algorithms, and since only the algorithm with the smaller estimate

is enabled at any given time, the total complexity of algorithm CONhybid cannot exceed the

complexity of the cheaper of the two algorithms by more than a factor of four.

8 Fast minimum spanning tree algorithms

This section proceeds as follows. In order to explain the more complex algorithms MST ,,t and

MSThybrid, we start with a description of the algorithm of [GHS83], referred to as "algorithm

MSTghs", and show that it has communication complexity O(E+ V -log n) and time complexity

O(E + V • log n). Then, we develop our new algorithms:

* An algorithm MSThybrid with communication complexity O(min{E + V log n, n • V}).

" An algorithm MSTjast with communication complexity O(E log V log n) and time com-

plexity O(Diam(MST) log V log n) = O(nD log V log n).

8.1 Algorithm MSTghs

The algorithm consists of two stages, namely, a wake-up stage followed by a work stage. The

wake-up stage of [GHS83] is based on flooding a "wake-up" message through the network.

The work stage can be thought as a modification of the following simple algorithm. The

algorithm consists of logn phases. At each phase, we have a number of "fragments", i.e.,

subtrees of the MST, that try to merge, in parallel, into larger fragments. Towards this goal,

each fragment performs the following steps:

1. The name of the fragment is broadcast from the root to all the vertices.

2. Each vertex scans, serially and in decreasing order of weights, all edges that have not

been scanned so far, until an edge outgoing from the fragment is found. The name of

that edge and its weight are reported to the parent.

28

3. Each vertex collects reports from its children containing the name of the edge chosen

by the subtree rooted at the child. Among those, the minimum weight edge is selected

and propagated to the parent. The vertex also marks its edge to the child iii whose

subtree the selected edge is located.

4. When the root selects its outgoing edge, the path from the selected edge to the root

has been marked. Now, the root of the tree is moved along this path, and the fragment

is hooked onto another fragment along the outgoing edge.

It is easy to implement the above algorithm if the phases of different fragments are

synchronized. However, synchronization requires scanning all the edges. The algorithm of

[GHS83] manages to accomplish its task without synchronization, thus saving communica-

tion. For more details, see [GHS83I.

Lemma 8.1 Algorithm MST 9h, requires O(E + V . logn) communication and O(E + V . logn)

time.

Proof: The wake-up stage obviously takes O(E) communication and O(D) time. Throughout

the algorithm, each non-tree edge is scanned at most twice, and each tree edge is scanned

log n times. It follows that the communication complexity is O($ + V - log n). The time

complexity is naturally bounded by the communication complexity. I

Unfortunately, this algorithm does not exhibit any parallelism; its time complexity couldbe almost as high as its communication complexity. This is due to the fact that the edges

scanned in a given phases may be much heavier that the weight of the MST itself. Thus,

edge scanning contributes a term of E to the time complexity. The communication over MST

contributes additional O(Diam(MST) . log n) = O(V • log n) time.

8.2 Algorithm MSThybrid

The "hybrid" MST algorithm, called MSThybiid, is obtained according to the following plan:

1. Modify the wake-up phase of algorithm MSTghs so that instead of flooding, wake-up is

performed via the DFS algorithm of the previous section.

2. Combine the resulting algorithm with algorithm MSTCet,, as in Section 7.2.

The idea of the first step is that the protocol becomes "controlled", i.e., the root is aware

of the communication wasted so far. This makes it easy to perform the next step, where we

achieve the minimum between the two algorithms.

29

Corollary 8.2 Algorithm MSThybrd has communication complexity O(min{E + V log n, nV}).

8.3 Algorithm MSTfast

The algorithm MSTfaat is a modification of algorithm MST 9h8 . The idea i- to reduce the time

it takes to scan very heavy edges that obviously do not belong to the MST. Also, we wish

to avoid the time-consuming process of scanning the edges serially. Towards this goal, we

modify the process of selecting an outgoing edge of a fragment as follows.

In order to avoid the scanning of heavy edges, the root makes a "guess" for the weight

of the outgoing edge. Initially, this guess is 1. If the guess is too low, then the process of

searching for an outgoing edge fails. In this case, the root doubles its guess and repeats the

search. This continues until the search succeeds.

In order to achieve concurrency in the edge scanning process, the vertices scan all the

edges that are below the value guessed by the root in parallel. This guarantees that the time

of the search is upper-bounded by O(Diam(MST)).

Corollary 8.3 Algorithm MSTfa, t requires O(E -log n log V) communication and

O(Diam(MST) log V log n) = O(nD log V log n) time.

9 Fast shortest path tree algorithms

9.1 Algorithm SPTsynch

Algorithm SPTy,,ch is the fastest SPT algorithm we know of. It is obtained by combining

the synchronizer of Section 4 with the synchronous SPT algorithm.

Observe that the synchronous SPT algorithm runs in O(D) time and has O(E) com-

munication complexity (again, making the assumption that a message sent on edge e re-

quires precisely w(e) time). The synchronizer adds O(kn log n) communication overhead

and O(logk n log n) time units for each of the D time units of the original synchronous pro-

tocol. We thus have

Corollary 9.1 The algorithm SPTsuh requires O(E + D . kn • logn) in communication and

O(E). log k n- logn) time.

30

9.2 Algorithm SPTrecur

Observe that a "weighted" network G = (V, E, w) can be reduced to a BFS problem on an"unweighted" network Gb (V, F, t) where an edge e is substituted by a path containing

we edges and we - 1 "dummy vertices", and weight ii(e) = 1 for all edges e. Thus, we can

construct an SPT of the original graph by running the BFS algorithm as in [Awe89] on the"unweighted" version C = (V', k, ib) of the network.

The BFS algorithm of [Awe89] is based an a very simple BFS algorithm, referred to in

the sequel as the DIJKSTRA Algorithm, due to its resemblance to Dijkstra's shortest path

algorithm (cf. [Gal82]) and Dijkstra-Sholten distributed termination detection procedure

[DS80. This algorithm is similar to Algorithm SPTentr of Section 6, with the only difference

being that the names of vertices that join the tree do not have to be broadcast along the

tree.

The algorithm maintains a tree rooted at the source vertex. Initially, the tree is empty.

Upon termination of the algorithm, the tree is the desired BFS tree. Throughout the algo-

rithm, the tree can only grow, and at any time it is a subtree of the final BFS tree. The

algorithm opeit,cs in successive iterations, each 'cding one more BFS layer to the tree. At

the beginning of a given iteration 1, the tree contains all vertices in layers less than or equal

to I - 1. Upon termination of iteration 1, the tree is extended to layer I as well.

The complexities of this algorithm are 0(d n + E) messages and 0(d. D) time, where d

is the number of layers being processed. Indeed, there are d iterations and in each of them

synchronization is performed over the BFS tree, which requires 0(n) messages and O(D)

time. In addition, one exploration message is sent over each edge once in each direction. Ob-

viously, the performance of the algorithm degrades as the number d of layers to be processed

incleases.

The idea behind [Awe89 is to reduce the problem where a large number of layers needs

to be processed to a problem with a small number of layers. If the network has diameter D,

we can conceptually "slice" the network into - "strips" of length d, and process those strips

sequentially (see Figure 9).

In [Awe89], an efficient reduction strategy is proposed. It is proved in [Awe89] that, in

the unweighted ,case, O(D ) time units are spent on each BFS layer, and O(E) messages

traverse each network edge. In our (weighted) case, we have D layers, and e edges in the"unweighted" version of our network. Thus, we have

Corollary 9.2 The weighted communication and time complexities of the resulting protocol

SPTrecur are O(l+f) and Q(V+f), respectively.

31

Strip Sources

Figure 9: Strip Method.

9.3 The hybrid SPT algorithm SPThybid

As before, it is possible to combine the two SPT algorithms SPTu,,,h and SPTre,,r and obtain

an algorithm SPThybrid that is as efficient in communication as either one of the two. This is

done in a manner similar to design of the hybrid MST algorithm MSThybrid in Section 8.2.

Corollary 9.3 The weighted communication and time complexities of the resulting protocol

SPThybrid are given by O(min{E' + , E + n -D}).

Acknowledgments

We warmly thank Oded Goldreich for his illuminating comments on a previous draft.

32

References

[AAPS87] Yehuda Afek, Baruch Awerbuch, Serge A. Plotkin, and Michael Saks. Local

management of a global resource in a communication network. In Proc. 28th

IEEE Symp. on Foundations of Computer Science, October 1987.

[ABLP89] Baruch Awerbuch, Amotz Bar-Noy, Nati Linial, and David Peleg. Compact

distributed data structures for adaptive network routing. In Proc. 21st ACMSymp. on Theory of Computing, pages 230-240, May 1989.

[AGPV89] Baruch Awerbuch, Oded Goldreich, David Peleg, and Ronen Vainish. A tradeoff

between information and communication in broadcast protocols. J. of the A CM,

1989.

[ALSY88] Y. Afek, G.M. Landau, B. Schieber, and M. Yung. The power of multimedia:

combining point-to-point and multiaccess networks. In Proc. of the 7th ACMSymp. on Principles of Distributed Computing, pages 90-104, Toronto, Canada,

August 1988.

[AP90a] Baruch Awerbuch and David Peleg. Network synchronization with polylugarith-

mic overhead. In Proc. 31st IEEE Symp. on Foundations of Computer Science,pages 514-522, 1990.

[AP90b] Baruch Awerbuch and David Peleg. Sparse partitions. In Proc. 31st IEEE Symp.

on Foundations of Computer Science, pages 503-513, 1990.

[AP91] Baruch Awerbuch and David Peleg. Routing with polynomial communication -

space trade-off. SIAM J. on Discr. Math., 1991. To appear. Also in Technical

Memo TM-411, MIT, Sept. 1989.

[Awe85a] Baruch Awerbuch. Complexity of network synchronization. J. of the ACM,

32(4):804-823, October 1985.

[Awe85b] Baruch Awerbuch. A new distributed depth-first-search algorithm. Info. Process.

Letters, 20:147-150, April 1985.

[Awe87] Baruch Awerbuch. Optimal distributed algorithms for minimum weight spanning

tree, counting, leader election and related problems. In Proceedings of the 19th

Annual ACM Symposium on Theory of Computing, pages 230-240, May 1987.

33

[Awe89] Baruch Awerbuch. Distributed shortest paths algorithms. In Proc. 21st ACM

Symp. on Theory of Computing, pages 230-240, May 1989.

[BKJ83] K. Bharath-Kumar and J. M. Jaffe. Routing to multiple-destinations in computer

networks. IEEE Trans. on Commun., pages 343-351, March 1983.

[DS8O] Edsger W. Dijkstra and C. S. Scholten. Termination detection for diffusing com-

putations. Info. Process. Letters, 11(1):1-4, August 1980.

[ER90] Shimon Even and Sergio Rijsbaum. The use of a synchronizer yields maximum

computation rate in distributed networks. In Proc. 22nd ACM Symp. on Theory

of Computing. ACM SIGACT, ACM, May 1990.

[Eve79] Shimon Even. Graph Algorithms. Computer Science Press, 1979.

[Ga182] Robert G. Gallager. Distributed minimum hop algorithms. Technical Report

LIDS-P-1175, MIT, Lab. for Information and Decision Systems, January 1982.

[CH11S83] Robert G. Gallager, Pierre A. Humblet, and P. M. Spira. A distributed algorithm

for minimum-weight spanning trees. ACM Trans. on Programming Lang. and

Syst., 5(1):66-77, January 1983.

[GS86] 0. Goldrei,h and L. Shrira. The effects of link failures on computations in asyn-

chronous rings. In Proc. 5th ACM Symp. on Principles of Distributed Computing,

pages 174-186. ACM, August 1986.

fJaf8O] Jeffrey Jaffe. Using signalling messages instead of clocks. Unpublished

manuscript., 1980.

[Jaf85] Jeffrey M. Jaffe. Distributed multi-destination routing: the constraints of local

information. SIAM Journal on Computing, 14:875-888, 1985.

[Lam78] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system.

Comm. of the ACM, 21(7):558-565, July 1978.

[PU89] David Peleg and Jeffrey D. Ullman,. An optimal synchronizer for the hypercube.

SIAM J. on Comput., 18(2):740-747, 1989.

[Seg83] Adrian Segall. Distributed network protocols. IEEE Trans. on Info. Theory,

IT-29(1):23-35, January 1983. Some details in technical report of same name,

MIT Lab. for Info. and Decision Syst., LIDS-P-1015; Technion Dept. EE, Publ.

414, July 1981.

34