Para-Functional Programming

Domesticating Parallelism

Para-Funotional

gProgramming

Paul HudakYale University

This methodologytreats amultiprocessor as a

single autonomouscomputer onto whicha program is

mapped, rather thanas a group ofindependentprocessors.

60

The importance of parallel computinghardly needs emphasis. Many physi-cal problems and abstract models

are seriously compute-bound, since se-quential computer technology now facesseemingly insurmountable physical limita-tions. It is widely believed that the onlyfeasible path toward higher performanceis to consider radically different computerorganizations, in particular ones ex-ploiting parallelism. This argument isindeed rather old now, and considerableprogress has been made in the construc-tion of highly parallel computers.One of the simplest and most promising

types of parallel machines is the well-known multiprocessor architecture, a col-lection of autonomous processors witheither shared or distributed memory thatare interconnected by a homogeneouscommunications network and usuallycommunicate by sending messages. Theinterest in machines of this type is not sur-prising, since not only do they avoid theclassic "von Neumann bottleneck" bybeing effectively decentralized, but theyare also extensible and in general quiteeasy to build. Indeed, more than a dozencommercial multiprocessors either arenow or will soon be available.Although designing and building multi-

processors has proceeded at a dramatic

pace, the development ofeffective ways toprogram them has generally not. This is anunfortunate state of affairs, since ex-perience with sequential machines tells usthat software development, not hardwaredevelopment, is the most critical elementin a system's design. The immense com-plexity of parallel computation can onlyincrease our dependence on software.Clearly we need effective ways to programthe new generation of parallel machines.

In this article I introducepara-function-al programming, a methodology forprogramming multiprocessor computingsystems. It is based on a functional pro-gramming model augmented with featuresthat allow programs to be mapped tospecific multiprocessor topologies. Themost significant aspect of the methodol-ogy is that it treats the multiprocessor as asingle autonomous computer onto which aprogram is mapped, rather than as agroupof independent processors that carry outcomplex communication and require com-plex synchronization. In more conven-tional approaches to parallel program-ming, the latter method of treatment isoften manifested as processes that co-operate by message-passing. However,such notions are absent in para-functionalprogramming; indeed, a single languageand evaluation model can be used from

0018-9162/86/0800-0060$01.00 1986 IEEE COMPUTER

problem inception, to prototypes targetedfor uniprocessors, and ultimately to reali-zations on a parallel machine.

Functionalprogramming andparallel computingThe future of parallel computing de-

pends on the creation of simple but effec-tive parallel-programming models (re-flected in appropriate language designs)that make the details of the underlyingarchitecture transparent to the user. Manyresearchers feel that conventional im-perative languages are inadequate for suchmodels, since these languages are intrin-sically tied to the "word-at-a-time" vonNeumann machine model. 1 Extendingsuch a sequential model to the parallelworld is like putting on a shoe that doesn'tfit. It makes more sense to use a languagewith a nonsequential semantic base.One of the better candidates for parallel

computing is the class of functional lan-guages (also known as applicative ordataflow languages). In a functional lan-guage, no side effects (such as thosecaused by an assignment statement) arepermitted. The lack of side effects ac-counts at least partially for the well-knownChurch-Rosser Property, which essential-ly states that no matter what order ofcom-putation is chosen in executing a program,the program is guaranteed to give the sameresult (assuming termination). This mar-velous determinacy property is invaluablein parallel systems. It means that programscan be written and debugged in a func-tional language on a sequential machine,and then the same programs can be exe-cuted on a parallel machine for improvedperformance. The key point is that infunctional languages the parallelism is im-plicit and supported by their underlyingsemantics. There is generally no need forspecial message-passing constructs orother communications primitives, no needfor synchronization primitives, and noneed for special "parallel" constructs suchas "parbegin...parend."On the other hand, doing without as-

signment statements seems rather radical.Yet clearly the assignment statement is anartifact of the von Neumann computermodel and is not essential to the mostabstract form of computation. In fact, a

major goal of high-level language designhas been the introduction of expressions,which transfer the burden of generatingsequential code involving assignmentsfrom programmer to compiler. Functionallanguages simply carry this goal to the ex-treme: Everything is an expression. Theadvantages of the resulting programmingstyle have been well-argued elsewhere, 1-2and will not be repeated here. However, Iwish to emphasize the following point:Although most experienced programmersrecognize the importance of minimizingside effects, the importance of doing so ina parallel system is intensified significant-ly, due to the careful synchronization re-quired to ensure correct behavior whenside effects are present. Without sideeffects, there is no way for concurrent por-tions of a program to affect one anotheradversely-this is simply another way ofstating the Church-Rosser Property.

The use of functional languages for par-allel programming is really nothing new.Such use has its roots in early work ondataflow and reduction machines, in thecourse of which many functional lan-guages were developed simultaneouslywith the design of new parallel architec-tures. Consider, for example, J. B. Den-nis's dataflow machine and the languageVAL, Arvind's U-interpreter and the lan-guage ID, A. L. Davis's dataflow machineDDM1 and the language DDN, and R. M.Keller's reduction machineAMPS and thelanguage FGL. Such work on automati-cally decomposing a functional programfor parallel execution continues today, andincludes Rediflow3 and my own work onserial combinators. 4

The aforementioned systems automati-cally extract parallelism from a programand dynamically allocate the resultanttasks for parallel execution. But whatabout a somewhat different scenario-one in which the programmer knows theoptimal mapping of his or her programonto a particular multiprocessor? Onecannot expect an automated system todetermine this optimal mapping for allprogram-processor combinations, so it isdesirable to provide the user with the abili-ty to express the mapping explicitly. (Theneed for this ability often arises, for exam-ple, in scientific computing, where manyclassic algorithms have been redesignedfor optimal performance on particular

ParAlfl provides amechanism formapping a program

onto an arbitrarymultiprocessor.

machines.) As it stands, almost no lan-guages provide this capability.

ParAlfi is a functional language thatprovides a simple yet powerful mechanismfor mapping a program onto an arbitrarymultiprocessor. The mapping is ac-complished by annotating subexpressionsso as to show the processor on which theywill be executed. With annotations, themapping can be done in such a way thatthe program's functional behavior is notaltered; that is, the program itself re-mains unchanged. The resulting method-ology is referred to as para-functionalprogramming, since it provides not only amuch-needed tool for expressing parallelcomputation, but also an operational se-mantics that is truly "extra," or "beyond"the functional semantics of the program.It is quite powerful, for several reasons:

* It is very flexible. Not only is para-functional programming easily adapted toany functional language, but also any net-work topology can be captured by thenotation, since no a priori assumptions aremade about the structure of the physicalsystem. All the benefits of conventionalscoping disciplines are available to createmodular programs that conform to thetopology of a given machine.

* The annotations are natural and con-cise. There are no special control con-structs, no message-passing constructs,and in general no forms of "excess bag-gage" to express the rather simple notionof where and when to compute things.

* With some minor constraints, if apara-functional program is stripped of itsannotations, it is still a perfectly validfunctional program. This means that itcan be written and debugged on a unipro-cessor that ignores the annotations, and

August 1986 61

An importantsemantic feature of

ParAlfl is lazyevaluation.

then executed on a parallel processor forincreased performance. Portability isenhanced, since only the annotations needto change when one moves from one par-allel topology to another (unless the algo-rithm itself changes). The ability to debuga program independently of the parallelmachinery is invaluable.

ParAlfI: a simplepara-functionalprogramming language

ParAlfl forms the testbed of Yale'spara-functional programming research. Itwas derived from a functional languagecalled ALFL,5 which is similar in style toseveral modern functional languages, in-cluding SASL (and its successors KRC andMiranda), 6 FEL,7 and Lazy ML. 8

The base language. To make ParAlfl ac-cessible to a broader audience, the baselanguage, as shown here, was changedsomewhat; for example, the arguments infunction calls are "tupled" rather than"curried." The interested reader is re-ferred to Henderson2 and to Darlington,Henderson, and Turner9 for a morethorough treatment of the functional pro-gramming paradigm. The salient featuresof the base language are

* Block-structuring is used, and takesthe form of an equation group with thefollowing configuration:

[fI(x1 ,...,xkl) = = expl;

f2(Xl,---,xk2) = = eXP2;

result exp;

fn(xl,...,xkn) = = expn I

An equation group is simply a collectionof mutually recursive equations (eachdefining a local identifier) together with asingle result clause that expresses the valueto which the equation group will evaluate(result is a reserved word). Equationgroups are just expressions, and can thusbe nested to an arbitrary depth.

* A double equal-sign (" = = ") is usedto distinguish equations from Boolean ex-pressions of the form expl = exp2. Theargument list is optional, allowing defini-tions of simple values, such as x = = exp.Since the equations are mutually recur-sive, and since ParAlfl is a lazy functionallanguage, the order of the equations isirrelevant. All values are essentially eval-uated "on demand."

* As in Lisp, the list is a fundamentaldata structure in ParAlfl. The operators

, ^ ^ , hd, and tl are like cons, append,car, and cdr, respectively, in Lisp. ^ and^ build lists: a' I is the list whose firstelement is aand the rest is just the list 1, and11 -12 is the list resulting from appendingthe lists ll and 12 together. Hd and tldecompose lists: hd(a-1) returns a, andtl(a ') returns 1. Aproper list (one endingin "nil") can be constructed with squarebrackets, as in [a,b,c], which is equivalentto a^b^c^[] (^ is right associative). Listsare constructed lazily.

* ParAlfl has functional arrays. Theequation v = = mka(d,f) (mka is shortfor "make array") defines a vector v of dvalues, indexed from I to d, such that theith element v[il is the same asf(i). Gen-erally, the equation a = = mka(d1,d2,...,dn,) defines an n-dimensional array asuch that a[iJ,...,in] = f(il,...,in). Ar-rays are constructed lazily, although theelements are computed in parallel. (See thesection entitled "Eager Expressions,"below.) In an earlier article on para-functional programming10 arrays weredefined as being non-lazy, or strict. Inreality, both kinds of array constructionare provided in ParAlfl.An important semantic feature of

ParAlfl is lazy evaluation. * That is, ex-pressions are evaluated on demand insteadof according to some syntactic rule, such

*Lazy evaluation is closely related to the call-by-namesemantics of Algol, but is different in that once an ex-pression is computed, its value is retained. In functioncalls, lazy evaluation is sometimes referred to as call-by-need evaluation.

as the order of identifier bindings. For ex-ample, one can write

[a == b*b;b == 2;result f(a,b);f(x,y) = = if p then y else x + y

Note that a depends on b, yet is definedbefore b. Indeed, the order of these equa-tions and the result clause is totally irrele-vant. Note further that the functionfdoesnot use its first argument ifp is true. Thus,in the callf(a,b), the argument a is neverevaluated (that is, the multiplication b * bnever happens) ifp is true.An often highlighted feature of lazy

evaluation is its ability to express un-bounded data structures, or infinite lists.For example, an infinite list of the squaresof the natural numbers can be defined by

[result squares(O);squares(n) = = n * n ^ squares(n + 1)

However, an important but often over-looked advantage of lazy evaluation issimply that it frees the programmer fromextraneous concerns about the order ofevaluation of expressions. Being freedfrom such concerns is very liberating forprogramming in general, but is especiallyimportant in parallel programmingbecause over-specifying the order of eval-uation can limit the potential parallelism.

Mapped expressions. A program can bemapped onto a particular multiprocessorarchitecture through the use of mappedexpressions. These form one of the twoclasses of extensions (annotations) to thebase language. (The other class is made upof eager expressions, which are describedbelow.) Mapped expressions have the sim-ple form

exp $on procwhich declares that exp is to be computedon the processor identified by proc (onproc is prefixed with $ to emphasize that$onproc is an annotation). The expressionexp is the body of the mapped expression,which is to say, it represents the value towhich the overall expression will evaluate(and thus can be any valid ParAlfl expres-sion, including another mapped expres-sion). The expression proc must evaluateto a processor ID. Without loss ofgeneral-ity, we will assume in all examples belowthat processor IDs, or pids, are integersand that there is some predefined mappingfrom those integers to the physical pro-cessors they denote. For example, a tree of

COMPUTER62

Figure 1. Two possiblenetwork topologies:infinite binary tree (a),and finite mesh of sizem x n (b). Listed witheach topology arefunctions that map pidsto neighboring pids.

processors might be numbered as shown inFigure 1(a) and a mesh as shown in Figure1(b). The advantage of using integers isthat the user can manipulate them withconventional arithmetic primitives; for ex-ample, Figure 1 also defines functions thatmap pids to neighboring pids. However, asafer discipline might be to define a pid asa unique data-type, and to provide primi-tives that enable the user to manipulatevalues having that type.

Simple examples of mapped expres-sions. Consider the program fragment

f(x) + g(y)The strict semantics of the + operatorallows the two subexpressions to beevaluated in parallel. If we wish to expressprecisely where the subexpressions are tobe evaluated, we can do so by annotatingthem, as in

(f(x) $on 0) + (g(y) $on 1)where 0 and I are processor IDs.Of course, this static mapping is not

very interesting. It would be nice, for ex-ample, if we were able to refer to a pro-cessor with respect to the currently exe-cuting one. ParAlfl provides this abilitythrough the reserved identifier $self,which when evaluated returns the pid ofthe currently executing processor. Using

$self we can be more creative. For exam-ple, suppose we have a mesh or tree of pro-cessors as shown in Figure 1; we can thenwrite

(f(x) $on left($self)) +(g(y) $on right($self))

to denote the computation of the twosubexpressions in parallel on neighboringprocessors, with the sum being computedon $self.We can describe the behavior of $self

more precisely as follows: self is boundimplicitly by mapped expressions; thus, in

exp Son pid$self has the value pid in exp, unless it isfurther modified by a nested mapped ex-pression. Although $self is a reservedword that cannot be redefined, this im-plicit binding can be best explained withthe following analogy:

exp $on pidis like

[ $self = = pid; result exp IHowever, the most important aspect of

$self is that it is dynamically bound infunction calls. Thus, in

[result (f(a) $on pid I) + (f(b) son pid 2);f(x) = = x x] $on pid3

a * a is computed on processor pid I, b * bon processor pid2, and the sum on pro-

cessor pid3. As before, an analogy isuseful in describing this behavior:

f(x,y,z,) = exp;... f(a,b,c) ...

is likef(x,y,z,$self) == exp;... f(a,b,c,$self) ...

In other words, all functions implicitlytake an extraformalparameter, $self, andall function calls use the current value of$self as the value for the new actual pa-rameter.Although very powerful, $self is not

always needed. Particular cases illustrat-ing this are those in which mappings can bemade from composite objects, such as vec-tors and arrays, to specific multiprocessorconfigurations. For example, if f is de-fined byf(i) = = i* *2 $on i, then thecall mka (nJ) will produce a vector ofsquares, one on each of n processors, suchthat the ith processor contains the ith ele-ment (namely i2). Further, suppose wehave two vectors v and w and we wish tocreate a third that is the sum of the othertwo, but distributed over the n processors.This can be done very simply by

mka(n,g);g(i) = = (v[i] + w[i]) son i

If v and w were already distributed in thesame way, this would express the pointwise

August 1986 63.

The value of an"eager expression" isthat of the expression

without theannotation.

parallel summation of two vectors on nprocessors.

A note on lexical scoping and datamovement. Consider the following typicalsituation: A shared value v is to be com-puted for use in two independent subex-pressions, el and e2; the values of thesesubexpressions are then to be combinedinto a single result. In a conventional lan-guage one might express this as somethinglike

begin v := code-for-v;el := code-for-el;

(Comment: uses v)e2 := code-for-e2;

(Comment: uses v)result : combine(el,e2);

end;and in ParAlfl one might write

v = = code-for-v;el = = code-for-el; (Comment: uses v)e2 - = code-for-e2; (Comment: uses v)result combine(el ,e2) I

Both of these programs are very clear andconcise.

But now suppose that this same com-putation is to begin and end on processorp, and the subexpressions el and e2 are tobe executed in parallel on processors q andr, respectively. In a conventional languageaugmented with explicit process-creationand message-passing constrUcts, onemight write the following program:

process PO;v : = code-for-v;send(v,PI);send(v,P2);el : = receive(PI);e2 : = receive(P2);result : = combine(el,e2)

end-process;

process PI;v : receive(PO);el = code-for-el; (Comment: uses v)send(el,PO);

end-process;

process P2;v : = receive(PO);e2 : = code-for-e2; (Comment: uses v)send(e2,PO);

end-process;which is then actually run by executingsomething like

invoke PO on processor p;invoke P1 on processor q;invoke P2 on processor r;

Note that the structure of the original pro-gram has been completely destroyed. Ex-plicit processes and communicationsbetween them have been introduced tocoordinate the parallel computation. Thesemantics ofboth the process-creation andcommunications constructs need to becarefully defined before the run-timebehavior can be understood. This pro-gram is no longer as clear nor as concise asthe original one.On the other hand, a ParAlfl program

for this same task is simplyv = = code-for-v;el = = code-for-el $on q;

(Comment: uses v)e2 = = code-for-e2 $on r;

(Comment: uses v)result combine(el ,e2) 3 $on p

Note that if the three annotations areremoved, the program is identical to theParAlfl program given earlier! No com-munications primitives or special syn-chronization constructs are needed to sendthe value of v to processors q and r; stan-dard lexical scoping mechanisms ac-complish the data movement naturallyand concisely. The values of el and e2 aresent back to processor p in the same way.

Eager expressions. The second form ofannotation, the eager expression, arisesout of the occasional need for the pro-grammer to override the lazy-evaluationstrategy of ParAlfl, since normallyParAlfl does not evaluate an expressionuntil absolutely necessary. (This secondtype of annotation is not needed in a func-tional language with non-lazy semantics,such as pure Lisp, but as mentionedearlier, we prefer the expressiveness af-forded by lazy semantics.) An eager ex-pression has the simple form

#exp

which forces the evaluation of exp in par-allel with its immediately surrounding syn-tactic form, as defined below:

If #exp appears as

* an argument to a function (for exam-ple, f(x,#y,z)), then it executes inparallel with the function call.

* an arm of a conditional (for example,ifp then #x elsey), then it executes inparallel with the conditional.

* an operand of an infix operator (forexample, x^#y; another example isx and #y), then it executes in parallelwith the whole operation.

* an element of a list (for example,[x,#y,zJ), then it executes in parallelwith the constructiorn of the list.

Thus, for example, in the expression ifpthen f(#x,y) else z, the evaluation of xbegins as soon asp has been determined tobe true, and simultaneously the functionfis invoked on its two arguments. Note thatthe evaluation of some subexpressionbegins when any expression is evaluated,and thus to evaluate that subexpression"eagerly" accomplishes nothing. For ex-ample, note the following equivalences:

if #p then x else y - if p then x else y#x and y - x and y#x + #y = x + yA special case of eager computation oc-

curs in the construction of arrays, whichare almost always used in a context wherethe elements are computed in parallel.Because of this, the evaluations of theelements of an array are defined to occureagerly (and in parallel, of course, if ap-propriately mapped).

Eager expressions are commonly usedwithin lists. Consider, for example, the ex-pression [x,#y]; normally lists are con-structed lazily in ParAlfl, so the values ofxand y are not evaluated until selected. Butwith the annotation shown, y would beevaluated as soon as the list was de-manded. As with arrays, however, the ex-pression does not wait for the value ofy toreturn a fully computed value. Instead, itreturns a partially constructed list just as itwould with lazy evaluation.The above discussion leads us to an im-

portant point about eager expressions:The value of an eager expression is that ofthe expression without the annotation. Aswith mapped expressions, the annotationonly adds an operational semantics, andthus the user can invoke a nonterminatingsubcomputation, yet have the overall pro-

64 COMPUTER

gram terminate. Indeed, in the above ex-ample, even shouldynot terminate, ifonlythe first element of the list is selected forlater use, the overall program may still ter-minate properly. The "runaway process"that computes y is often called an irrele-vant task, and there exist strategies forfinding and deleting such tasks at runtime. Such considerations are beyond thescope of this article, although it should bepointed out that given an automatic task-collection mechanism there are real situa-tions in which one may wish to invoke anonterminating computation (an exampleof this is given in Hudak and Smith 10).

A note on determinacy. All ParAlfl pro-grams possess the following determinacyproperty:A ParAlfl program in which (1) theidentifier $self appears only in pid ex-pressions, and (2) all pid expressionsterminate without error, is functionallyequivalent to the same program with allof the annotations removed. That is,both programs return the same value.

(A formal statement and proof of thisproperty depends on a formal denota-tional semantics for ParAlfl, which isbeyond the scope of this article, but suchsemantics can be found in Hudak andSmith. 10)The reason for the first constraint is that

if the mapping annotations are removed,all remaining occurrences of $selfhave thesame value, namely the pid ofthe root pro-cessor. Thus, removing the annotationsmay change the value of the program. Thepurpose of the second constraint shouldbe obvious: If the system diverges or errswhen determining the processor on whichto execute the body of a mapped expres-sion, then it will never get around to com-puting the value of that expression.Although neither determinacy con-

straint is severe, there are practical reasonsfor wanting to violate the first one (that is,for wanting to use the value of $self inother than a pid expression). The mosttypical situation where this arises is in anonisotropic topology where certain pro-cessors form a boundary for the network(for example, the leaf processors in a tree,or the edge processors in a mesh). Thereare many distributed algorithms whosebehavior at such boundaries is differentfrom their behavior at internal nodes. Toexpress this, one needs to know when exe-

cution is occurring at the boundary of thenetwork, which can be conveniently deter-mined by analyzing the value of $self.

Sample applicationprograms

In this section two simple examples are

presented that highlight the key aspects ofpara-functional programming. Spacelimitations preclude the inclusion of ex-

amples that are more complex, but some

can be found in Hudak and Smith '0 andHudak. I I

Parallel factorial. Figure 2 shows a sim-ple parallel factorial program annotated

Figure 2. Divide-and-conquer factorial onfinite tree.

mcsso

Figure 3. Dataflow for

parallel factorial.

for execution on a finite binary tree of n

2d I processors. Although computing

factorial, even in parallel, is a rather simn-ple task the example demonstrates several

imnportant ideas, and most other divide-

and-conquer algorithms could easily fit

into the same framework.

The algorithm is based on splitting the

computation into two parts at each itera-

tion and mapping the two subtasks onto

the "children" of the current processor.

Note that through the normal lexical scop-

ing rules, mid will be computed on the cur-

rent processor and passed to the child pro-cessors as needed (recal the discussion in

the section on "Mapped expressions,tabove). The functions left and right

describe the network mapping necessary

65August 1986

I result pfac(l,k) $on root;

pfac(lo,hi)= =if lo=hi then loelse if lo=(hi-1) then lo-hi

else l result (pfac(lo,mid) $on left($self)) *(pfac(mid+1,hi) $on right($self));

mid==(lo+hi)/2 l;

left(pe)= =if 2'pe>n then pe else 2*pe;right(pe)= =if 2 -pe>n then pe else 2-pe+1;root = = 1;

I

Figure 4.Parallel

factorial, withunique

behavior atleaves.

Figure 5.ParAlf programto solve Ux= b.

result xvect;xvect = =mka(n,x);x(i) = = I result (b[i] -sum(n,0)) / U[i,i];

sum(j,acc)= =ifj<i+1 then accelse sum(j-1,acc+xvect[j]-U[i,j) I

I

Figure 6. Dataflow for matrix problem. ("BS" stands for backsubstitution.)

for this topology, and Figure 3 shows theprocess-mapping and flow of data be-tween processes when k=5.

Note that the program in Figure 2 obeysthe constraints required for determinacy,and thus the program returns the samevalue regardless of the annotations. Notefurther that with the mapping used, whenprocessing reaches a leaf node all furthercalls to pfac are executed on the leaf pro-cessor. Routing functions of greater com-plexity could be devised that, for example,would reflect the computation upwardonce a leaf processor is reached. Alter-natively, it might be desirable to use amoreefficient factorial algorithm at the leafnodes. An example of this is given inFigure 4, where the tail-recursive functionsfac is invoked at the leaves. Determiningthat execution has reached a leaf processorrequires inspection of $self, and thus thedeterminacy constraints are violated, yetthe program still returns the same valueregardless of the annotations. This con-

stancy of values is, of course, often thecase, but it cannot be guaranteed ingeneral without the previously discussedconstraints.

Solution to upper triangular blockmatrix. The next example is typical ofproblems encountered in scientific com-

puting: The problem is to solve for thevector x in the matrix equation Ux=b,where U is an upper triangular blockmatrix (that is, a matrix whose elementsare themselves matrices, and whose ele-ments below the main diagonal contain allzeros). Algorithms using block matricesare especially suited to multiprocessorswith nontrivial communications costs,since typically the subcomputations in-volving the submatrices can be done inparallel with little communication be-tween the processors.

Processor 5 Compute x5 i

- = Data (and temporal) dependencyProcessor 4 Back-substitute x5 > Compute X4 -X Temporal dependency imposed by pipelining

Processor 3 Back-substitute x5 Back-substitute x4 > Compute x3I I X

Processor 2 Back-substitute x5 Back-substitute x4 Back-substitute x3 - Compute x2

I I IProcessor 1 Back-substitute x5 - Back-substitute x4 Back-substitute x3 > Back-substitute x2 > Compute x,

Figure 7. Pipelining data for matrix problem.

result pfac(1,k) $on root;

pfac(lo,hi)= =if lo=hi then loelse if lo=(hi-1) then lo-hi

else I result if leaf?($self) then sfac(lo,hi,l)else (pfac(lo,mid) $on left($self)) *

(pfac(mid+1,hi) $on right($self));mid==(lo+hi)/2 I;

sfac(lo,hi,acc) = =if lo=hi then lo-accelse sfac(lo +1l,hi,loacc);

leaf?(pe)==pe >= 2**(d-1);left(pe) = =2*pe;right(pe)= =2.pe+1;root = = 1;

66 COMPUTER

If we ignore parallelism at the momentand concentrate instead on a functionalspecification of this problem, it is easy tosee from basic linear algebra that each ele-ment xi in the solution vector x (of lengthn) can be given by the following equation:

i+1xi = (bi - E xjUij) / Ui,

j=n

where we assume for convenience that thesubmatrices are of unit size (and are thusrepresented simply as scalar quantities).Given this equation for each element, it iseasy to construct the solution vector inParAlfl, as shown in Figure 5.This problem, as it is, has plenty of par-

allelism. To see this, look at Figure 6, adataflow graph showing the data depen-dencies when n = 5. Clearly, once an ele-ment ofthe solution is computed, all ofthebacksubstitutions of it can be done in par-allel; that is, each of the horizontal"steps" in Figure 6 can be executed. Thisparallelism derives solely from the datadependencies inherent in the problem, andis mirrored faithfully in the ParAlfl code.Indeed, if we have n processors sharing acommon memory, we can annotate theprogram in Figure 5 very simply:

[result xvect;xvect == mka(n,x);x(i) == ...$on i

where "..." denotes the same expressionused in Figure 5 for x(i). (Recall that theelements of an array are computed in par-allel, and thus do not require eager an-notations.)

But let us consider topologies that aremore interesting. Consider, for example, aring of n processors. Although thetopology of a ring is simple, its limitedcapacity for interprocessor communica-tion makes it difficult to use effectively,and it is thus a challenge for algorithmdesigners. We will assume that the pro-cessors are labeled consecutively aroundthe ring from "l" to "n," and that the ithrow of U and ith element of b are on pro-cessor i. We wish the solution vectorxto bedistributed in the same way.We should first note that the annotated

program two paragraphs above would runperfectly well on such a topology, especial-ly with the given distribution of data. Theonly data movement, in fact, would bethat of each submatrix xi for use on eachprocessor j, j> i. This data movement

I result xvect;xvect = =mka(n,x);x(i) = =I result (b[i]-sum(n,O)) / U[ii];

sum(j,acc) = =if j < i +1 then accelse sum(j-1,acc+xpipe[i][n-j+1].U[i,j])

l $on i;xpipe = =mka(n,xfn);xfn(i) = = I result mka(n -i + 1 ,xlocal);

xlocal(j) = =if j =n -i + 1 then xvect[i]else xpipe[i + 1 ][j] I $on

Figure 8. A program for pipelining xi around a ring of processors.

Figure 9. A vector pipeline for x.

would be done transparently by theunderlying operating system, and in thiscase the program would probably performadequately.

Yet in our dual role of programmer andalgorithm designer we may have a par-ticular routing strategy that is provablygood and that we wish to express explicitlyin the program. For example, one efficientstrategy is to "pipeline" the xi around thering as they are generated. That is, the ele-ment xi is passed to processor i-l, usedthere, passed to processor i-2, used there,and so on, as shown graphically in Figure7. There are several ways to accomplishthis effect in the program, and we shall ex-plore two of them.The first requires the least change to the

existing program, and is based on shiftingthe data by creating a partial copy of thesolution vector on each processor, asshown in Figure 8. Note that the first fourlines of this program are essentially thesame as those given earlier. Figure 9 showsthe construction of xpipe-note the cor-

respondence between this diagram and theone in Figure 7.The second way to express the pipelin-

ing of data is to interpret the algorithmfrom the outset as a network of dynamicprocesses rather than as a static set of vec-tors and arrays. In particular, we can con-jure up the following description of a pro-cess running on processor i:

"Process i takes as input a stream ofvalues x,,xn,, ..., xi+,. It passesthis stream of values to process i-lwhile back-substituting each value intobi. When the end of the stream isreached, it computes xi and adds thisto the end of the stream being passed toprocess i-l."

Assuming the same distribution of Uand bused earlier, we can represent this processdescription in ParAlfl as shown in Figure10. Note that xi is annotated for eagerevaluation, to override the lazy evaluationof lists. Also note the correspondence be-tween this program and the last. The main

August 1986 67

I result process(n,[]); (Comment: begin on processor n with empty stream)process(i ,xstr) = =

I xi= -(b[i]-sum(xstrn,O))/U[i,i];result if =1 then addtostr(xstr,#xi)

else process(i -1 ,addtostr(xstr,#xi));addtostr(old,x) = =old ^ [xl;sum(str,j,acc)= =if str=[] then acc

else sum(tl(str) ,j-1,acc+hd(str)-U[i,j])$oni

Figure 10. Program for Ux =b simulating network of processes.

Figure 11. Program toembed ring in

hypercube.

Figure 12. Embeddingof ring of size 8 into

3-cube.

difference is in the choice ofdata structurefor x-a list is used here, resulting in a

recursive structuring of the program,whereas a vector was used previously,resulting in a "flat" program structure.Choices of this kind are in fact typical ofany suitably rich programming language,and are equally important in paralel andsequential programming. Different datastructures can, of course, be mapped indifferent ways to machines, but in this ex-

ample the annotations are essentially thesame in both programs.To carry this example one step further,

let us now consider running any of theabove ParAlfl programs on a multiproces-sor with a hypercube interconnectiontopology rather than a ring. One way toaccomplish this is to simulate a ring in ahypercube by some suitable embedding.Probably the simplest such embedding isthe reflected gray-code, captured by theParAlfl functions shown in Figure I1. In

I

that figure, log2(i) returns the base-2 log-arithm of i, rounded down to the nearestinteger (the vector v is used to "cache"values of graycode(i)). For example,Figure 12 shows the embedding ofa ring ofsize 8 into a 3-cube.

If we then replace the previous anno-

tations "... $on i" with "... $onringtocube(i)," we arrive at the desiredembedding. Note that the code for the al-gorithm itselfdid not change at all, just theannotations. Of course, a more efficientalgorithm for the hypercube might exist orthe initial data distribution might be dif-ferent, and both cases would naturaly re-

quire recoding of the main functions.

A Z hen viewed in the broad scopeof software development meth-odologies, the use ofpara-func-

tional programming suggests the folow-ing scenario:

1. One first conceives of an algorithmand expresses it cleanly in a func-tional programming language. Thishigh-level program is likely to bemuch closer to the problem specifica-tions than conventional languagerealizations, thus aiding reasoningabout the program and facilitatingthe debugging process.

2. Once the program has been written, itis debugged and tested on either a se-quential or paralel computer system.In the latter case, the compiler ex-tracts as much paralelism as it canfrom the program, but with no in-tervention or awareness on the part ofthe user.

3. If the performance achieved in steptwo does not meet one's needs, theprogram is refined by affixing an-notations that provide more subtlecontrol over the evaluation process.These annotations can be addedwithout affecting the program'sfunctional behavior.

There are two aspects of this methodologythat I think significantly facilitate programdevelopment: First, the functional aspectsof a program are effectively separatedfrom most of the operational aspects. Sec-ond, the multiprocessor is viewed as asingle autonomous computer onto which aprogram is mapped, rather than as a groupof independent processors that carry outcomplex communication and require com-plex synchronization. Together with theclean, high-level programming style af-forded by functional languages, these twoaspects promise to yield a simple andeffective programming methodology formultiprocessor computing systems.

Extensions and implementation issues.In this article I have presented only thefundamental ideas behind para-functionalprogramming. Work continues on severaladvanced features and alternative annota-tions that provide even more expressivepower. These include: (I) annotations thatreference other operational aspects of aprocessor, such as processing load; (2)mappings to operating system resources,such as disks and I/O devices; (3) in-troduction of nondeterministic primitiveswhere needed; and (4) annotations to con-trol memory usage. The latter two featuresare especialy important, since they allowone to overcome two traditional objec-

68

ingtocube(i) = =v[il;v= =mka(n,graycode);graycode(i) = =if < 2 then i

else I result v[2*mid -i-1]+mid;mid= =2**log2(i) I

r

COMPUTER

tions to programming in the functionalstyle: the inability to deal with thenondeterminism that is prevalent, for ex-ample, in an operating system, and ineffi-ciency in handling large data structures.Space limitations preclude me from delv-ing into such issues, but the reader can findadditional details in Hudak and Smith 10and Hudak. 1I

In addition, by concentrating in this ar-ticle on how to express parallel computa-tion, I have left unanswered many ques-tions about how one can implement apara-functional programming language.In recent years great advances have beenmade in implementing functional lan-guages for both sequential and parallelmachines, and much of that work is appli-cable here. In particular, graph reductionprovides a very natural way to coordinatethe parallel evaluation of subexpressions,and solves problems such as how tomigrate the values of lexically boundvariables from one processor to another.At Yale a virtual parallel graph reducercalled Alfalfa is currently being im-plemented on two commercial hypercubearchitectures: an Intel iPSC and anNCubehypercube. This graph-reduction enginewill be able to support both implicit(dynamic) and explicit (annotated) taskallocation. The only difficult languagefeature to support efficiently in para-functional programming is a mechanismfor referencing elements in a distributedarray; in most cases this is quite easy, but incertain cases it can be difficult. Althoughgood progress has been made in this area,the work is too premature to report here.

Related work. The work that is mostsimilar in spirit to that presented in thisarticle is E. Shapiro's systolic program-ming in Concurrent Prolog 12; the map-ping semantics of systolic programmingwas derived from earlier work on "turtleprograms" in Logo. Other related effortsinclude those of R. M. Keller and G. Lind-strom,13 who, independent of our re-search at Yale and in the context of func-tional databases, suggest the use of an-notations similar to mapped expressions;and F. W. Burton's 14 annotations to thelambda calculus to provide control overlazy, eager, and parallel execution. A morerecent effort is that of N. S. Sridharan, 15who suggests a "semi-applicative" pro-gramming style to control evaluationorder. All in all, these efforts contribute to

what I think is a powerful programmingparadigm in which operational and func-tional behavior can coexist with littleadverse interaction. El

Acknowledgments

I am especially indebted to LaurenSmith (now at Los Alamos NationalLaboratory), to whom most of these ideaswere first presented for critical review. Herefforts at applying the ideas to real prob-lems were especially useful. A specialthanks is extended to Yale's ResearchCenter for Scientific Computation (under

References1. J. Backus, "Can Programming Be Liber-

ated from the von Neumann Style? AFunctional Style and Its Algebra of Pro-grams," CACM, Vol. 21, No. 8, Aug.1978, pp. 613-641.

2. P. Henderson, Functional Programming:Application and Implementation, Pren-tice-Hall, Englewood Cliffs, N. J., 1980.

3. R. M. Keller and F C. H. Lin, "SimulatedPerformance of a Reduction-based Multi-processor," Computer, Vol. 17, No. 7,July 1984, pp. 70-82.

4. P. Hudak and B. Goldberg, "DistributedExecution of Functional Programs UsingSerial Combinators," Proc. 1985 Int'lConf. on Parallel Processing, Aug. 1985,pp. 831-839; also appeared in IEEETrans. Computers, Vol. C-34, No. 10,Oct. 1985, pp. 881-891.

5. P. Hudak, "ALFL Reference Manual andProgrammer's Guide," research reportYALEU/DCS/RR-322, 2nd ed., Oct.1984, Yale University, Dept. of ComputerScience, Box 2158 Yale Station, NewHaven, CT 06520.

6. D. A. Turner, "Miranda: A Non-strictFunctional Language with PolymorphicTypes," Functional Programming Lan-guages and ComputerArchitecture, Sept.1985, Springer-Verlag, New York, pp.1-16.

the direction of Martin Schultz), whichprovided the motivation for much of thiswork. The research also benefited fromuseful discussions with Jonathan Youngand Adrienne Bloss at Yale, Bob Keller atthe University of Utah, and Joe Fasel andElizabeth Williams at Los Alamos Na-tional Laboratory. Finally, I wish to thankthe three anonymous reviewers for Com-puter magazine who worked with my arti-cle, as well as Assistant Editor LouiseAnderson; their comments helped im-prove the presentation.

This research was supported in part byNSF Grants DCR-8403304 and DCR-8451415, and a Faculty DevelopmentAward from IBM.

7. R. M. Keller, tech. report, "FEL Pro-grammer's Guide," No. 7, Mar. 1982,University of Utah, Dept. of ComputerScience, Merrill Engineering Bldg., SaltLake City, UT 84112.

8. L. Augustsonn, "A Compiler for LazyML," ACM Symp. on LISP and Func-tional Programming, Aug. 1984, pp.218-227.

9. J. Darlington, P. Henderson, and D. A.Turner, Functional Programming and ItsApplications, Cambridge UniversityPress, Cambridge, UK, 1982.

10. P. Hudak and L. Smith, "Para-Function-al Programming: A Paradigm for Pro-gramming Multiprocessor Systems," 12thACM Symp. on Principles of Program-ming Languages, Jan. 1986, pp. 243-254.

11. P. Hudak, "Exploring Para-FunctionalProgramming," research report YALEU/DCS/RR-467, Apr. 1986, Yale University,Dept. of Computer Science, Box 2158Yale Station, New Haven, CT 06520.

12. E. Shapiro, "Systolic Programming: AParadigm of Parallel Processing," tech.report CS84-21, Aug. 1984, The Weiz-mann Institute of Science, Dept. of Ap-plied Mathematics; appeared in Proc.Int'l Conf. on Fifth-Generation Com-puter Systems, Nov. 6-9, 1984, pp.458-470.

August 1986 69

SOFTWARE ENGINEERSRooketdyne Division

in Southern California

join the Software Engineering Team at Rocketdyne, a division of Rockwell International,where you will contribute to America's future through projects such as the Space ShuttleMain Engine, the Space Station Electrical Power System, and the Strategic DefenseInitiative. Rocketdyne is building a Software Engineering Team for the future and nowis the right time to sign up. Be prepared to work in a quiet. convenient, modern softwaredevelopment support environment.

SOFTWARE SYSTEMS ENGINEER-Space Shuttle Main Engine

Wil generate/maintain software requirement specifications for the next generation ShuttleMain Engine avionics (based on M68000). Five years software systems experience withworking knowledge of real-time control electronics and a BS/MS in EE, Physics, Mathor CS is required. Excellent written/oral communication skills also required.

SOFTWARE DESIGN ENGINEERS-Space Shuttle Main Engine

Will code/integrate software for M68000-based engine controller in VAX hosted devel-opment/test facility. One year experience with C and microprocessors and a BS in Engi-neering. Math or Physics required.

Software Support ToolsWill develop/test software development and management tools. Familiarity with VAX,microprocessor development systems, man-machine interface development, or C a plus.Five years experience with a BS in EE, CS, Math, or Physics required.

SOFTWARE TESTENGINEERS-Space Shuttle Main Engine

Will develop/execute software validation test procedures utilizing VAX I 1/780 and AD 10based software development and test laboratory. Familiarity with VAX VMS. M68000,C. or real-time control a plus. Two years experience with formal software testing anda BS in CS, EE, Physics or Math required.

VAXSYSTEM SOFTWARESUPPORT-Software Development Lab

Will support users/operations in an integrated software development/test lab which in-cludes of VMS. MASS-I I, ALL IN I, CMS, Rdb, DTM, DATATRIEVE. C, and Ada. Twoyears experience in VAX operations/systems programming required.

Rockwell International offersan outstandingcompensation and benefits package whichincludes company-paid medical, dental and life insurance: vision care coverage andcompany-contributing savings plan, to name just a few. Please send resume in confidenceto: Loretta Young, (IEEEC886), Employment Office #1, Rocketdyne Division, RockwellInternational, 6633 Canoga Avenue, Canoga Park, CA 91303. Equal OpportunityEmployer M/F. U.S. Citizenship Required.

'm Rockwell International... where science gets down to business

13. R. M. Keller and G. Lindstrom, "Ap-proaching Distributed Database Imple-mentations Through Functional Pro-gramming Concepts," Int'l Conf. onDistributed Systems, May 1985.

14. E W. Burton, "Annotations to ControlParallelism and Reduction Order in theDistributed Evaluation of FunctionalPrograms," ACM Trans. on Program-ming Languages and Systems, Vol. 6,No. 2, Apr. 1984.

15. N. S. Sridharan, tech. report, "Semi-applicative Programming: An Example,"Nov. 1985, BBN Laboratories, Cam-bridge, Mass.

Paul Hudak received his BS degree in electricalengineering from Vanderbilt University,Nashville, Tenn., in 1973; his MS degree incomputer science from the Massachusetts In-stitute of Technology, Cambridge, Mass., in1974; and his PhD degree in computer sciencefrom the University of Utah, Salt Lake City,Utah, in 1982. From 1974 to 1979 he was amember of the technical staff at Watkins-Johnson Co., Gaithersburg, Md.Hudak is currently an associate professor in

the Programming Languages and SystemsGroup in the Dept. of Computer Science atYale University, New Haven, Conn., a positionhe has held since 1982. His primary research in-terests are functional and logic programming,parallel computing, and semantic programanalysis.He is a recipient of an IBM Faculty Develop-

ment Award (1984-85), and an NSF Presiden-tial Young Investigator Award (1985).

Readers may write to Paul Hudak at YaleUniversity, Dept. of Computer Science, Box2158 Yale Station, New Haven, CT 06520.

COMPUTER

000005..

Para-Functional Programming

Documents

Transcript of Para-Functional Programming