Limits on the Power of Parallel Random Access Machines with Weak Forms of Write Conflict Resolution

Limits on the Power of Parallel Random AccessMachines with Weak Forms of Write Con ictResolutionFaith E. Fich1, Russell Impagliazzo2, Bruce Kapron3, Valerie King3, and Miros lawKuty lowski41 University of Toronto2 University of Toronto, present address: University of California, San Diego3 University of Toronto, present address: University of Victoria4 Heinz Nixdorf Institut, Universit�at Paderborn, present address: Uniwersytet Wroc law1 IntroductionA parallel random access machine (PRAM) that allows concurrent writes must spec-ify how to resolve write con icts when they occur. One method is to let an adversarydetermine what value appears [1]. In other words, an algorithm must compute thecorrect answer no matter what values appear when write con icts occur. Hagerupand Radzik [12] call this model the ROBUST PRAM. Such a model is very weak,since the way an adversary resolves write con icts may depend on the entire algo-rithm and the values of all the inputs.We de�ne a �xed adversary PRAM to be a concurrent-read concurrent-writePRAM in which the value that appears in a given cell after a given time step canbe expressed as a function of only the value already in the cell, the processorsattempting to write to the cell, and the values they are attempting to write. Themethod for determining the outcome of write con icts for di�erent memory cells ordi�erent time steps may be di�erent. However, the choice of methods cannot dependon what algorithm will be run. Essentially, the write con ict resolution mechanism isalgorithm oblivious. By de�nition, any �xed adversary PRAM is at least as powerfulas the ROBUST PRAM.Many di�erent write con ict resolution schemes for the PRAM have been studied.In the PRIORITY PRAM [5, 8], the processor of lowest index that attempts towrite into a given cell at a given time step succeeds. When two or more processorssimultaneously attempt to write into the same cell in the COLLISION PRAM [5], aspecial collision symbol appears. In the TOLERANT PRAM [9], the value of the cellremains unchanged in case of a write con ict. Another example is the MAXIMUMPRAM [4], in which the largest value written to a memory cell at a given time stepis the value that appears there. These are all examples of �xed adversary PRAMs.In the ARBITRARY PRAM [5, 7], an adversary is allowed to determine whichone of the values that is written to a given cell will appear. Unlike the ROBUSTPRAM, the adversary cannot leave the cell contents unchanged nor write someunrelated value when a write con ict occurs. The ARBITRARY PRAM is at least aspowerful as the ROBUST PRAM. However, it is not an example of a �xed adversaryPRAM.

The COMMON PRAM [5, 13] is a �xed adversary PRAM that can only run arestricted class of algorithms. In this model, many processors may simultaneouslyattempt to write to the same cell, provided they all attempt to write the samevalue. This value appears as the result of the write. The relationship between thecomputational powers of the COMMON and ROBUST PRAMs is not well under-stood. Speci�cally, it is unknown whether there are problems that can be solvedsubstantially faster by the ROBUST PRAM than by the COMMON PRAM or viceversa.Except for the ROBUST PRAM, every previously studied concurrent-read concurrent-write (CRCW) PRAM can easily compute the OR of n Boolean variables in constanttime using n processors, whereas (logn) steps are necessary for the concurrent-readexclusive-write (CREW) PRAM, even with an unbounded number of processors[2, 3]. It is unknown whether the ROBUST PRAM can compute OR any fasterthan the CREW PRAM. More generally, it is not known whether this very weakform of concurrent-write is useful for deterministically computing any function overa complete domain.We address the question of the relative power of the ROBUST PRAM andthe CREW PRAM. In Section 2, we prove a lower bound of (log(d)= log(kq))on the time required by the ROBUST PRAM to compute Boolean functions interms of qk, the number of di�erent values each memory cell of the PRAM cancontain, and the degree d of the function when expressed as a polynomial over the�nite �eld of q elements. For almost all Boolean functions, including OR, this im-plies that the ROBUST PRAM with 1-bit memory cells requires (logn) steps.This is signi�cant, since every function can be computed by the CREW PRAMwith 1-bit memory cells in O(logn) steps [2]. We also derive a lower bound of �minnplog(d); log(d)= log logq(p)o� steps for computing any Boolean functionof degree d over IFq by the ROBUST PRAM with p processors, even if memorycells can contain arbitrarily large values. In particular, the ROBUST PRAM with22O(plog n) processors requires (plogn) steps to compute OR. These lower boundsare obtained using carefully chosen �xed adversary PRAMs.In Section 3, we show the limitations of these techniques, by describing howany �xed adversary PRAM with p processors and O(log(p=n))-bit memory cells cancompute OR in O(logn= log log(p=n)) steps. In particular, with 22(plog n) processors,O(plogn) steps are needed. With 2n � 1 processors, only 2 steps and one n-bitmemory cell are needed. This result gives rise to a simulation of the p-processorPRIORITY PRAM by any �xed adversary PRAM using only a constant factormore time and a factor of p more memory cells, but with an exponential increase inthe number of processors.Adding randomization to the ROBUST PRAM increases its computational power.Borodin, Hopcroft, Paterson, Ruzzo, and Tompa [1] showed that the randomized RO-BUST PRAM can compute OR in O(log logn) time, with any constant probabilityof error. Hagerup and Radzik [12] also showed that O(log logn) time is su�cientto compute OR, using only n processors and with error probability 2�(logn)O(1) . InSection 4, we present a new randomized ROBUST PRAM algorithm that uses onlyO(log� n) time and n=(log� n) processors and has error probability 2�O(2(log� n)=4).

In contrast, even with an unlimited number of processors, the randomized CREWPRAM requires (logn) steps [3]. Together with our lower bounds, these resultsprove a separation between the deterministic and randomized ROBUST PRAM.For the CREW PRAM, no such separation exists, since randomization provides nomore than a factor of 8 in speedup over the deterministic model [3].Throughout this paper, we use P1; : : : ; Pp to denote the p processors of a PRAMand M1; : : : ;Mm to denote its m memory cells. If the input consists of n variables,x1; : : : ; xn, we assume that they are initially located in memory cells M1; : : : ;Mn,respectively. The other memory cells are initialized to 0. Each step of a PRAMcomputation is assumed to consist of three phases. In the �rst phase, each processorcan read from a cell of shared memory; in the second phase, processors are allowed todo an arbitrary amount of local computation; and in the third phase, each processorcan attempt to write to a cell of shared memory. Formal de�nitions of the PRAMmodel can be found in [2, 3].2 Lower BoundsGiven any �eld F and any function f : D ! R, where D � Fn and R � F ,the polynomial g : Fn ! F represents f over F if g(x) = f(x) for all x 2 D.In particular, every Boolean function can be uniquely expressed as a multilinearpolynomial (i.e. as a sum of monomials) over any �eld [14]. The degree of f over F isde�ned to be the minimum degree of any polynomial that represents f over F . Thedegree of f over F is unde�ned if there is no polynomial that represents f over F .In [3], it was shown that the time complexity of computing any Boolean functionof degree d over the �eld of real numbers is �(logd). Here, using a similar approach,we derive lower bounds for the time to compute a Boolean function on a particular�xed adversary PRAM (and hence the ROBUST PRAM) in terms of its degree degqover the �nite �eld IFq of q elements.Theorem1. On the ROBUST PRAM with memory cells that can hold at mostqk di�erent values, (log(d)= log(kq)) steps are required to compute any Booleanfunction of degree d over IFq.To prove this result, it su�ces to prove the lower bound on any �xed adversaryPRAM, each of whose memory cells holds a k-tuple of elements in IFq . The key isto choose a �xed adversary PRAM with nice properties.When memory cells can contain only single bits (i.e. when q = 2 and k = 1),we use the IF2-adversary PRAM. In this model, the value of a shared memory cellchanges exactly when an odd number of processors simultaneously attempt to writethe negated value. In other words, if a memory cell contains the value 0 (1) beforea given step, then it will contain the value 1 (0) after the step, if there are an oddnumber of processors that attempt to write the value 1 (0) there during the step,and it will remain 0 (1) otherwise.For larger values of q, the write con ict resolution rule can be generalized. Specif-ically, in the IFq-adversary PRAM, if a memory cell M contains the value v 2 IFq

before a given step, then, after that step, M will contain the valuev+ Xu2IFq(u�v)�(the number of processors that attempt to write the value u to M );where all operations are performed in the �eld IFq. Note that, if no processors writeto M , then the sum is empty and M will retain its old value v. If all the processorsthat write to M attempt to write v, M also retains the value v. However, if exactlyone processor writes to M , then M will contain the value that processor wrote.The IFq-adversary PRAM works component-wise for k > 1, treating each compo-nent of each memory cell separately. Short tuples can be padded with leading zeroswhen necessary, so the actual number of components k does not need to be speci�ed.However, when k is bounded, it is possible to derive lower bounds for the time tocompute Boolean functions on the IFq-adversary PRAM.Lemma2. On the IFq-adversary PRAM with memory cells that hold k-tuples of ele-ments in IFq, (log(d)= log(kq)) steps are required to compute any Boolean functionof degree d over IFq.Proof. Consider any IFq-adversary PRAM algorithm for memory cells that hold k-tuples of elements in IFq. Without loss of generality, we may assume that a processor'sstate is merely (an encoding of) the sequence of values it has read at each step (so aprocessor never forgets information). Given an input x = (x1; : : : ; xn) 2 f0; 1gn, letSP;t;w(x) = �1 if processor P is in state w immediately after step t,0 otherwise.Then SP;t;w(x) is the characteristic function describing those inputs x for whichprocessor P is in state w at time t. Note that for di�erent states w and w0, the setof inputs for which processor P is in state w at time t is disjoint from the set ofinputs for which processor P is in state w0 at time t. Thus for any set of states W ,Pw2W SP;t;w(x) is the characteristic function of the set of inputs for which processorP is in a state of W at time t.Let CM;t;i(x) denote the contents of the ith component of memory cell M imme-diately after step t on input x and letBM;t;i;u(x) = �1 if CM;t;i(x) = u;0 otherwisebe the characteristic function of the set of inputs for which the ith component ofmemory cell M has value u at time t. De�nes(t) = maxP;w degq(SP;t;w);c(t) = maxM;i degq(CM;t;i); andb(t) = maxM;i;udegq(BM;t;i;u)to be the maximum degrees of these functions over IFq.

Since each processor is in its initial state at time 0 for all inputs, the asso-ciated characteristic functions are all constant, so s(0) = 0. The initial contentsof the least signi�cant components of the memory cells M1; : : : ;Mn are describedby the linear functions x1; : : : ; xn, respectively. The other components of memorycells M1; : : : ;Mn and all k components of any other memory cells are all initially0 and their contents are described by the constant 0 function. Furthermore, fori = 1; : : : ; k, BM;0;i;1(x) = CM;0;i(x), BM;0;i;0(x) = 1�CM;0;i(x), and BM;0;i;u(x) = 0if u 2 IFq � f0; 1g. Therefore c(0) = b(0) = 1.Processor P is in state v(1); : : : ; v(t�1); v� immediately after step t if and only ifit is in state v(1); : : : ; v(t�1)� immediately after step t � 1 and the memory cell Mthat it reads during step t contains value v = (vk; : : : ; v1). HenceSP;t;hv(1);:::;v(t�1);vi(x) = SP;t�1;hv(1) ;:::;v(t�1)i(x) � kYi=1BM;t�1;i;vi(x)and s(t) � s(t � 1) + kb(t� 1) for t > 0:Now consider the writing phase of step t. Let W (P;M; t; i; u) denote the set ofstates in which processor P writes the value u to the ith component of memory cellM at step t. Then Xw2W (P;M;t;i;u)SP;t;w(x) has value 1, if processor P attempts towrite the value u to the ith component of memory cell M at step t on input x, andhas value 0, otherwise. It follows from the de�nition of the write con ict resolutionrule that the ith component of the contents of memory cell M at the end of step tcan be expressed asCM;t;i(x) = CM;t�1;i(x) + Xu2IFq (u� CM;t�1;i(x))XP 0@ Xw2W (P;M;t;i;u)SP;t;w(x)1A ;when viewed as a polynomial over IFq . Thus c(t) � c(t� 1) + s(t) for t > 0:Finally, over IFq ,BM;t;i;u(x) = 1� (CM;t;i(x)� u)q�1 = �1 if CM;t;i(x) = u0 otherwise;so b(t) � (q � 1)c(t).It is easily shown by induction thats(t) � k(q � 1)� "�k(q � 1) + 2 +�2 �t + �k(q � 1) + 2��2 �t# andc(t) � k(q � 1) + �2� �k(q � 1) + 2 + �2 �t � k(q � 1)��2� �k(q � 1) + 2��2 �twhere � = pk(q � 1)[k(q � 1) + 4].If an algorithm computes a Boolean function f , then the contents of the leastsigni�cant component of the output cell at the end of the computation is describedby the function f . Hence the number of steps t taken by the algorithm satis-�es c(t) � degq(f), which implies that t � log�2�degq(f)k(q�1)+�� = log�k(q�1)+2+�2 � 2(log(degq(f))= log(kq)). ut

In the special case when the memory cells can contain only single bits, it followsfrom the bound derived in the proof of Lemma 2 that at least '(deg2(f)) steps arerequired to compute the Boolean function f . Here '(d) = minfjjF2j+1 � dg andF0; F1; F2; : : : is the Fibonacci sequence de�ned by the recurrence F0 = 1, F1 = 1,and Fj = Fj�1 + Fj�2 for j � 2. In particular, since the OR of n Boolean variableshas degree n over IF2, the IF2-adversary PRAM and the ROBUST PRAM bothrequire '(n) steps to compute this function with single bit memory cells. This lowerbound exactly matches the upper bound for computing OR on the CREW PRAMwith single bit memory cells [2].The threshold k function of n Boolean variables and the exactly k out of nfunction both have degree at least n=2 over IF2. In fact, over IF2, only a very tinyfraction (at most 1=22n�1) of all polynomials of n variables have degree less thann=2. Hence, most Boolean functions of n arguments require at least '(n) � 1 stepsto be computed by the ROBUST PRAM with single bit memory cells.Any nonconstant Boolean function f satis�es jf�1(0)j; jf�1(1)j � 2n�deg2(f) [6].Thus, if 0 < jf�1(0)j � 2n�d or 0 < jf�1(1)j � 2n�d, then deg2(f) � d, so theROBUST PRAM with 1-bit memory cells requires at least '(d) steps to compute f .The PARITY function of n variables has degree 1 over IF2. However, over IF3,it has degree n. It follows from the bound derived in the proof of Lemma 2 that aROBUST PRAM whose memory cells can hold at most 3 di�erent values requiresat least 0:45 log2 n steps to compute PARITY.For the ROBUST PRAM, the actual values written to and read from sharedmemory are not important. They can be renamed essentially arbitrarily withouta�ecting the number of steps performed. Speci�cally, let A be any ROBUST PRAMalgorithm and let h : IN ! IN be any bijection of the natural numbers that mapseach possible input or output value to itself. Then, construct a new algorithm h(A)in which every processor applies the function h to each value it attempts to writeinto shared memory and applies the function h�1 to each value it reads from sharedmemory. Because the input and output values are not a�ected by the renaming, thealgorithms A and h(A) compute the same function. Furthermore, since processorsare allowed to perform an arbitrary amount of local computation at each step, h(A)uses no more steps than A.Lemma3. Consider any ROBUST PRAM algorithm A computing a Boolean func-tion using p processors. There is a renaming function h such that, during the �rst tsteps of h(A) on the IFq-adversary PRAM, all except the �2t logq(2pq)� least signi�-cant components of every memory cell are 0 (when integers are represented in q-arynotation) and, after step t, every processor is in at most (2pq)2t�1 di�erent states.Proof. The proof is by induction on n. Initially, the only values in shared memoryare 0 and 1 and every processor is in its one initial state. Since 1 � b20 logq(2pq)cand 1 = (2pq)20�1, the claim is true for t = 0 with h as the identity function.Let t � 0 and assume the claim is true for t with renaming function h. Considerstep t+ 1 of h(A) on the IFq-adversary PRAM. At the beginning of this step, eachprocessor is in at most (2pq)2t�1 states and, in each state, it can read one of at mostqe di�erent values, where e = �2t logq(2pq)�. Therefore, at the end of step t + 1,every processor is in at most (2pq)2t�1 �qe � (2pq)2t�1 � (2pq)2t = (2pq)2t+1�1 states.

Let V denote the set of those values greater than or equal to qe that processorsattempt to write to shared memory during this step. Since each processor can writeat most (2pq)2t�1 � qe di�erent values during time step t + 1, it follows that jV j �p � qe � (2pq)2t�1. Let h0 be any bijection that maps the same element as h does toeach of 0; : : : ; qe� 1 and maps each element in V to the range fqe; : : : ; qe + jV j � 1g.Note that every execution of h0(A) on the IFq-adversary PRAM is identical to anexecution of h(A) on the IFq-adversary PRAM for the �rst t steps. Furthermore, eachvalue a processor attempts to write during step t+ 1 of h0(A) on the IFq-adversaryPRAM has value less than qe + jV j, so all except its �logq (qe + jV j)� � e + 1 +jlogq �p(2pq)2t�1�k � �2t+1 logq(2pq)� least signi�cant q-ary digits are 0. It followsfrom the de�nition of the IFq-adversary PRAM that all except the �2t+1 logq(2pq)�least signi�cant components of every memory cell are 0 at the end of step t+1. Thusthe claim is true for t+ 1 with renaming function h0. utTheorem4. With p processors (and memory cells that can contain arbitrarily largevalues), the ROBUST PRAM requires �minnplog(d); log(d)= log logq(p)o� stepsto compute any Boolean function of degree d over IFq.Proof. Let f be a Boolean function of degree d over IFq that can be computedby a ROBUST PRAM algorithm A in T steps. By Lemma 3, there is a renamingfunction h such that when h(A) is run on the IFq-adversary PRAM, all except the�2T logq(2pq)� least signi�cant components of every memory cell are 0. Since h(A)computes f , it follows from Lemma 2 that T 2 � log(d)log(2T logq(2pq))� or, equivalently,T 2 + T log logq(2pq) 2 (log(d)). Hence T 2 �minnplog(d); log(d)log logq(p)o�. utIn particular, since deg2(OR) = n, the ROBUST PRAM with 22O(plog n) proces-sors requires (plogn) steps to compute the OR of n bits. Moreover, to computeOR in constant time, the ROBUST PRAM requires at least 2n(1) processors andshared memory cells with wordsize at least n(1).3 Constant Time Upper Bounds for Fixed AdversaryPRAMsIt is not known whether the ROBUST PRAM can compute the OR of n bits fasterthan the CREW PRAM, even if it is allowed an unlimited number of processorsand an unlimited number of memory cells of unlimited wordsize. (The wordsize ofa memory cell is the number of bits needed to represent the values it can contain.)In this section, we show that any �xed adversary PRAM can even simulate thePRIORITY PRAM with only a constant factor increase in time, given su�cientresources. In particular, this indicates that new proof techniques will be needed toprove that the ROBUST PRAM is no more powerful than the CREW PRAM.The key to the simulation is an algorithm for computing OR.Theorem5. The OR of n bits can be computed in 2 steps on any �xed adversaryPRAM with 2n � 1 processors and one memory cell of wordsize n.

Proof. Let p = 2n � 1. For any subset W � f1; 2; :::; pg, let v(W ) denote the valuethat appears in the memory cell after step 1 when each processor whose number isin W attempts to write its number to the memory cell and the remaining processorsdo not attempt to write. The memory cell is assumed to be initialized to 0.If no processors attempt to write, the contents of the memory cell remains un-changed and if exactly one processor attempts to write, the memory cell will containthe number of that processor. Thus the range of the function v has size at least p+1.We will partition the processors into three sets, A, N , and D, with jDj = n, andshow that there exists a subset S � D such that v(A[S) 6= v(A[S0) for all subsetsS0 � D that di�er from S. Using these sets, the following algorithm computes theOR of n Boolean variables. In the read phase of the �rst step, each processor inD is assigned a di�erent bit of input to read. Processors in A will always attemptto write their numbers, processors in N will never write, and each processor in Dwill write its number depending on the value of the input bit it read. A processorin S attempts to write its number if it reads the value 0 and a processor in D � Sattempts to write its number if it reads the value 1.If the input bits are all 0, then the processors in A [ S are exactly those thatattempt to write and, as a result, the memory cell will contain the value v(A [ S).Otherwise, the set of processors attempting to write is A[S0 for some subset S0 � Dthat di�ers from S. In this case, the memory cell will contain a value other thanv(A [ S).Therefore, in the next step, any predetermined processor can complete the com-putation by reading the contents of the memory cell and writing the value 0 if it sawthe value v(A [ S) and writing the value 1 if it saw anything else.Now we show how to construct sets A, N , and S satisfying the desired properties.Since the function v has a range of size at least p + 1, there is at least one value rin the range that is the image of at most 2p=(p+ 1) di�erent subsets in the domain.Let C initially be the collection of subsets that map to r (i.e. C = v�1(r)).While there are at least two subsets remaining in C, choose a processor P thatoccurs in some but not all of the subsets in C. If the majority of subsets in C containP , add P to N and remove any subset that contains P from C. Otherwise, add P toA, and remove any subset that does not contain P from C.At each step, C is reduced in size by at least a factor of 2. After no more thanlog(2p=(p+ 1)) = p � log(p + 1) = p � n iterations, only one subset remains andjA [ N j � p � n. The subsets remaining in C are those that map to r, contain A,and are disjoint from N .Let L be the last subset of processors in C. Let S = L � A. If S contains atmost n processors, let D be any set of n processors that contains S and is disjointfrom A[N . Otherwise move processors from S to A until S has size n and then letD = S. Since A [ S = L 2 C, v(A [ S) = r.Now suppose S0 � D and v(A [ S0) = r. Then A [ S0 was originally in C. ButA [ S0 contains no processors in N and every processor in A; therefore it was neverremoved from C. Since L is the only subset in C at the end of the construction,A [ S0 = L and, hence, S0 = S. Therefore v(A [ S0) 6= r for all S0 � D such thatS0 6= S. utCorollary 6. The OR of n bits can be computed in O(logn= log log(p=n)) steps on

any �xed adversary PRAM with p processors and wordsize log(p=n) + log log(p=n).Proof. Divide the bits into groups of size s = log(p=n) + log log(p=n) and assign2s � 1 processors to each group. Since there are at most n=s groups and (2s �1)n=s < n(p=n) log(p=n)=(log(p=n) + log log(p=n)) < p, there is a su�cient numberof processors available. By Theorem 5, the OR of each group can be computed inO(1) steps, leaving a problem of size n=s.If this process is repeated t = log(n)= log log(p=n) times, a problem of size n=st =1 remains, which is the solution to the original problem. utIn particular, the OR of n bits can be computed in O(plogn) steps using22(plog n) processors.Theorem7. The PRIORITY PRAM with p processors and m memory cells canbe simulated for t steps by any �xed adversary PRAM with 2p � 1 processors andm(p+ 1) memory cells in O(t) steps.Proof. Processor Pi is simulated by a team of 2i�1 processors, one of which is des-ignated P 0i . Each memory cell Mj is simulated using one memory cell M 0j and pauxiliary memory cells M 0j;1; : : : ;M 0j;p that are initialized to 0.When processor Pi reads Mj , the ith team of processors all read M 0j and they allperform the same local computations. When processor Pi attempts to write to Mj ,processor P 0i writes the value 1 in location M 0j;i. Then the ith team computes theOR of the values in locations M 0j;1; : : : ;M 0j;i�1 (using the algorithm in Theorem 5)and writes the answer in M 0j;i. Processor P 0i reads M 0j;i and, if it contains the value0, writes the value Pi attempted to write to location M 0j . Otherwise P 0i writes 0 toM 0j . ut4 Randomized Algorithms for Computing OR on theROBUST PRAMIn this section, we describe a randomized algorithm for computing OR in O(log� n)time on the ROBUST PRAM with single bit shared memory cells. The algorithmhas one-sided error: when the value 1 is output, it is always correct; when 0 is output,it errs with probability o(1). Virtually the same algorithm may be used to simulatea step of the ARBITRARY PRAM with the same error and time bounds, providedmemory cells can contain at least p di�erent values.Hagerup and Radzik [12] describe a randomized algorithm to simulate a stepof the ARBITRARY PRAM in time O(log logn) with error at most 1=n, usingO(n logn= log logn) processors. Our algorithm uses a routine REDUCE which issimilar to theirs, but replaces their randomized \scattering" technique with a simpledeterministic method. To achieve the better time bounds, our algorithm involves therepeated application of REDUCE, which compounds the error.The idea of our algorithm is to sample varying amounts of the input in parallel,in such a way that, when the input contains a 1, we �nd a sample in which a smallnumber of input bits are 1.

Let X = fx1; x2; :::; xng be the set of input bits. Using at most n(1 + log2 n)4processors and a constant number of parallel steps, REDUCE(X) reduces the inputset to a set Y containing at most 4(1 + log2 n)4 bits, so that if OR(X)=0 thenOR(Y)=0, and if OR(X)=1 then, with probability less than 1=n, OR(Y)=0.REDUCE (X)Perform the following dlog2 ne times, independently and in parallel:1. For each of the n input bits xi, associate a group of (blog2 nc + 1)3processors fPi;j;k;l j 0 � j; k; l � blog2 ncg, all of which read xi.2. Consider a matrix A with n rows and blog2 nc + 1 columns. For eachinput bit xi, one processor in its group, say Pi;j;1;1, writes the value 1 toA(i; j) with probability 1=2j, provided xi = 1; otherwise, it writes thevalue 0. Note that if xi = 0, then the entire ith row of A is 0.3. For each column j, initialize a (blog2 nc + 1) � 2 � (blog2 nc + 1) � 2array Bj to 0. This can be accomplished by having processors P1;j;k;l,P2;j;k;l, P3;j;k;l, and P4;j;k;l, write the value 0 to Bj(k; 0; l; 0),Bj(k; 0; l; 1),Bj(k; 1; l; 0), and Bj(k; 1; l; 1), respectively.4. Each processor Pi;j;k;l reads A(i; j) and, if A(i; j) = 1, writes the value1 to Bj(k; ik; l; il), where iblog2 nc+1 � � � i0 is the binary representation ofi.The 4(blog2 nc+1)3dlog2 ne element output array Y is created by concatenat-ing the entries in all the B arrays created during all of the parallel executions.If OR(X) = 0, then the value 1 is never written and the output Y is entirely 0. IfOR(X) = 1, the output Y should contain at least one 1. We show that this happenswith probability at least 1� 1=n.Lemma8. If OR(X) = 1, then Y fails to contain a 1 with probability at most 1=n.Proof. Assume OR(X) = 1. Let r be the number of ones contained in X. Considerany one of the parallel executions and let z be the number of ones in the blog2 rc'thcolumn of A.The probability that z = 0 is (1� 1=2blog2 rc)r � 1=e. The expected value of z isr(1=2blog2 rc) < 2. Cherno� bounds show that the probability of z being at least threetimes its expected value is less than 1=e4. Hence Pr[1 � z � 5] > 1�1=e�1=e4 > 1=2.Suppose that 1 � z � 5. In this case, some cell of Bblog2 rc will have the value 1written to it by exactly one processor. To see this, consider the indices i such thatA(i; blog2 rc) = 1. Then there exist two bits which distinguish one of these indicesfrom the others.The output Y fails to contain a 1 only when all the dlog2 ne parallel executionsfail to produce z within the required range. If OR(X) = 1, the probability of thesedlog2 ne independent events all occurring is less than 1=n. utIndependently, Hagerup [10] developed a similar technique, graduated conditionalscattering, that can be used to reduce the problem of computing the OR from n bitsto O((logn)2) bits with n processors and error probability O(1= logn).After a su�cient number of applications of REDUCE, the original problem isreduced to computing the OR of a very small number of bits, which may be done

deterministically. The resulting algorithm is always correct when it outputs 1, al-though it may err when it outputs 0.Lemma9. The OR of n bits can be computed by a ROBUST PRAM with O(n(logn)4)processors and 1-bit shared memory cells in O(log� n) steps using a randomized al-gorithm that errs with probability O(1=24(log� n)).Proof. Let h(n) denote the minimum integer h such that log(h) n � 2log� n. Thenh(n) � log� n. Iterate REDUCE h(n) times. The size of the remaining problem isin �((log(h(n)) n)4) � O(24 log� n). Solve this problem using a deterministic CREWPRAM algorithm in O(log� n) steps [2]. The failure probability of this algorithm isat most 1=n+Ph(n)�1h=1 1=((log(h) n)4) = O(1=(log(h(n)�1) n)4) � O(1=24 log� n). utUsing the standard technique of partitioning a problem into successively largersubproblems, the number of processors can be reduced to n and the error probabilitycan be improved.Lemma10. The OR of n bits can be computed by a ROBUST PRAM with n proces-sors and 1-bit shared memory cells in O(log� n) steps using a randomized algorithmthat errs with probability O(1=22(log� n)=4 ).Proof. First partition the bits into n=2log� n sets of size 2log� n and use a deterministicCREW PRAM algorithm to compute the OR of each set in O(log� n) steps. Thisleaves a subproblem of size n=2log� n.Let z0 = 2log� n, and for i � 0, let zi+1 = 2z1=4i �3. In the (i+1)st iteration, dividethe remaining n=zi bits into n=zisi subproblems each of size si = 2z1=4i �1 and allocatesi(1 + log2 si)4 = sizi processors to each. Apply REDUCE to each subproblem inparallel. This reduces each subproblem from size si to size 4(1 + log2 si)4 = 4zi.Therefore, at the end of the ith iteration, 4n=si = n=zi+1 bits remain. After t =minfijzi � ng iterations, the algorithm terminates.Provided n is su�ciently large, zi+2 � 2zi for all i � 0. Therefore t 2 O(log� n).Each iteration takes constant time. Hence, the time complexity of this algorithm isin O(log� n).The probability of error is the probability that some 1 in the input fails to appearin any of the reductions. >From the Lemma 8, it follows that the total probabilityof error is at most Pki=1 1=si�1 2 O(1=2z1=40 ) = O(1=22(log� n)=4 ). utThere is another way to obtain an algorithm for computing OR that runs inO(log� n) time and uses n processors: repeatedly apply graduated conditional scat-tering until the problem size is 2�(log� n) and then solve the remaining problemdeterministically, as in the proof of Lemma 9 [11]. The resulting algorithm has errorprobability 2�O(log� n).By �rst partitioning the bits into groups of size �(log� n) and computing theOR of each group sequentially, the number of processors can be reduced from n toO(n= log� n) without increasing the time by more than a constant factor. The tworesulting algorithms are e�cient in the sense that their time-processor products are�(n), the sequential complexity of computing OR.

Theorem11. The OR of n bits can be computed by a ROBUST PRAM with O(n= log� n)processors and 1-bit shared memory cells in O(log� n) steps using a randomized al-gorithm that errs with probability 2�O(2(log� n)=4).AcknowledgementsWe would like to thank Torben Hagerup and Charlie Racko� for helpful discussion.This work was supported by the Natural Sciences and Engineering Research Councilof Canada, the Information Technology Research Centre of Ontario, the DefenseAdvanced Research Projects Agency of the United States, NEC, and the DeutscheForschungsgemeinschaft.References1. A. Borodin, J. Hopcroft, M. Paterson, L. Ruzzo, and M. Tompa, \Observations Con-cerning Synchronous Parallel Models of Computation", manuscript, 1980.2. S. Cook, C. Dwork, and R. Reischuk, \Upper and Lower Time Bounds for ParallelRandom Access Machines without Simultaneous Writes", SIAM J. Comput., volume15, 1986, pages 87-97.3. M. Dietzfelbinger, M. Kuty lowski, and R. Reischuk, \Exact Time Bounds for Com-puting Boolean Functions without Simultaneous Writes", Proc. Second Annual ACMSymposium on Parallel Algorithms and Architectures, 1990, pages 125-135.4. D. Eppstein and Z. Galil, \Parallel Algorithmic Techniques for Combinatorial Com-puting", Ann. Rev. Comput. Sci., volume 3, 1988, pages 233-283.5. F. Fich, P. Ragde, and A. Wigderson, \Relations Between Concurrent-Write Modelsof Parallel Computation", SIAM J. Comput., volume 17, 1988, pages 606-627.6. P. Fischer, and H. U. Simon, \On Learning Ring-Sum-Expansions", Proc. Third Work-shop on Computational Learning Theory, 1990, to appear in SIAM J. Comput.7. S. Fortune and J. Wyllie, \Parallelism in Random Access Machines", Proc. 10th AnnualSymposium on Theory of Computing, 1978, pages 114-118.8. L. Goldschlager, \A Uni�ed Approach to Models of Synchronous Parallel Machines",JACM, volume 29, 1982, pages 1073-1086.9. V. Grolmusz and P. Ragde, \Incomparability in Parallel Computation", Discrete Ap-plied Mathematics, volume 29, no. 1, 1990, pages 63-78.10. T. Hagerup, \Fast and Optimal Simulations between CRCW PRAMs", Proc. 9th An-nual Symposium on Theoretical Aspects of Computer Science, 1992, pages 45-48.11. T. Hagerup, personal communication.12. T. Hagerup and T. Radzik, \Every Robust CRCW PRAM can E�ciently Simulate aPRIORITY PRAM", Proc. Second Annual ACM Symposium on Parallel Algorithmsand Architectures, 1990, pages 117-124.13. L. Kucera,\Parallel Computation and Con icts in Memory Access", Information Pro-cessing Letters, volume 14, 1982, pages 93-96.14. R. Smolensky, \Algebraic methods in the theory of lower bounds for Boolean circuitcomplexity", Proc. 19th Annual ACM Symposium on Theory of Computing, 1987, pages77-82.

This article was processed using the LaTEX macro package with LLNCS style

Limits on the Power of Parallel Random Access Machines with Weak Forms of Write Conflict Resolution

Documents

Transcript of Limits on the Power of Parallel Random Access Machines with Weak Forms of Write Conflict Resolution