Implementation and performance of Munin

13

Transcript of Implementation and performance of Munin

Implementation and Performance of MuninJohn B. Carter, John K. Bennett, and Willy ZwaenepoelComputer Systems LaboratoryRice UniversityHouston, TexasAbstractMunin is a distributed shared memory (DSM) systemthat allows shared memory parallel programs to be ex-ecuted e�ciently on distributed memory multiproces-sors. Munin is unique among existing DSM systems inits use of multiple consistency protocols and in its use ofrelease consistency. In Munin, shared program variablesare annotated with their expected access pattern, andthese annotations are then used by the runtime systemto choose a consistency protocol best suited to that ac-cess pattern. Release consistency allows Munin to masknetwork latency and reduce the number of messagesrequired to keep memory consistent. Munin's multi-protocol release consistency is implemented in softwareusing a delayed update queue that bu�ers and mergespending outgoing writes. A sixteen-processor proto-type of Munin is currently operational. We evaluateits implementation and describe the execution of twoMunin programs that achieve performance within tenpercent of message passing implementations of the sameprograms. Munin achieves this level of performancewith only minor annotations to the shared memory pro-grams.1 IntroductionA distributed shared memory (DSM) system providesthe abstraction of a shared address space spanningThis researchwas supported in part by the National Science Foun-dation under Grants CDA-8619893 and CCR-9010351, by theIBM Corporation under Research Agreement No. 20170041, andby a NASA Graduate Fellowship. Willy Zwaenepoel was on sab-batical at UT-Sydney and INRIA-Rocquencourt while a portionof this research was conducted.

the processors of a distributed memory multiproces-sor. This abstraction simpli�es the programming ofdistributed memory multiprocessors and allows paral-lel programs written for shared memory machines to beported easily. The challenge in building a DSM sys-tem is to achieve good performance without requiringthe programmer to deviate signi�cantly from the con-ventional shared memory programming model. Highmemory latency and the high cost of sending messagesmake this di�cult.To meet this challenge, Munin incorporates twonovel features. First, Munin employs multiple con-sistency protocols. Each shared variable declarationis annotated by its expected access pattern. Muninthen chooses a consistency protocol suited to that pat-tern. Second, Munin is the �rst software DSM systemthat provides a release-consistent memory interface [19].Roughly speaking, release consistency requires memoryto be consistent only at speci�c synchronization points,resulting in a reduction of overhead and number of mes-sages.The Munin programming interface is the same asthat of conventional shared memory parallel program-ming systems, except that it requires (i) all shared vari-able declarations to be annotated with their expectedaccess pattern, and (ii) all synchronization to be vis-ible to the runtime system. Other than that, Muninprovides thread, synchronization, and data sharing fa-cilities like those found in shared memory parallel pro-gramming systems [7].We report on the performance of two Muninprograms: Matrix Multiply and Successive Over-Relaxation (SOR). We have hand-coded the same pro-grams on the same hardware using message passing,taking special care to ensure that the two versions ofeach program perform identical computations. Com-parison between the Munin and the message passingversions has allowed us to assess the overhead associ-ated with our approach. This comparison is encour-aging: the performance of the Munin programs di�ersfrom that of the hand-coded message passing programsby at most ten percent, for con�gurations from one tosixteen processors.

The Munin prototype implementation consists offour parts: a simple preprocessor that converts the pro-gram annotations into a format suitable for use by theMunin runtime system, a modi�ed linker that createsthe shared memory segment, a collection of library rou-tines that are linked into each Munin program, and op-erating system support for page fault handling and pagetable manipulation. This separation of functionalityhas resulted in a system that is largely machine- andlanguage-independent, and we plan to port it to vari-ous other platforms and languages. The current proto-type is implemented on a workstation-based distributedmemorymultiprocessor consisting of 16 SUN-3/60s con-nected by a dedicated 10 Mbps Ethernet. It makes useof a version of the V kernel [11] that allows user threadsto handle page faults and to modify page tables.A preliminary Munin design paper has been pub-lished previously [5], as well as some measurements onshared memory programs that corroborate the basic de-sign [6]. This paper presents a re�nement of the design,and then concentrates on the implementation of Muninand its performance.2 Munin Overview2.1 Munin ProgrammingMunin programmers write parallel programs usingthreads, as they would on a shared memory mul-tiprocessor [7]. Munin provides library routines,CreateThread() and DestroyThread(), for this pur-pose. Any required user initialization is performed by asequential user init() routine, in which the program-mer may also specify the number of threads and proces-sors to be used. Similarly, there is an optional sequentialuser done() routine that is run when the computationhas �nished. Munin currently does not perform anythread migration or global scheduling. User threads arerun in a round robin fashion on the node on which theywere created.A Munin shared object corresponds to a single sharedvariable, although like Emerald [9], the programmer canspecify that a collection of variables be treated as a sin-gle object or that a large variable be treated as a num-ber of independent objects by the runtime system. Bydefault, variables larger than a virtual memory pageare broken into multiple page-sized objects. We usethe term \object" to refer to an object as seen by theruntime system, i.e., a program variable, an 8-kilobyte(page-sized) region of a variable, or a collection of vari-ables that share an 8-kilobyte page. Currently, Muninonly supports statically allocated shared variables, al-though this limitation can be removed by a minor mod-i�cation to the memory allocator. The programmer an-notates the declaration of shared variables with a shar-

ing pattern to specify the way in which the variable isexpected to be accessed. These annotations indicate tothe Munin runtime system what combination of proto-cols to use to keep shared objects consistent (see Sec-tion 2.3).Synchronization is supported by library rou-tines for the manipulation of locks and barri-ers. These library routines include CreateLock(),AcquireLock(), ReleaseLock(), CreateBarrier(),and WaitAtBarrier(). All synchronization operationsmust be explicitly visible to the runtime system (i.e.,must use the Munin synchronization facilities). Thisrestriction is necessary for release consistency to oper-ate correctly (see Section 2.2).2.2 Software Release ConsistencyRelease consistency was introduced by the DASH sys-tem. A detailed discussion and a formal de�nition canbe found in the papers describing DASH [19, 23]. Wesummarize the essential aspects of that discussion.Release consistency requires that each shared mem-ory access be classi�ed either as a synchronization ac-cess or an ordinary access.1 Furthermore, each syn-chronization accesses must be classi�ed as either a re-lease or an acquire. Intuitively, release consistency re-quires the system to recognize synchronization accessesas special events, enforce normal synchronization order-ing requirements,2 and guarantee that the results of allwrites performed by a processor prior to a release bepropagated before a remote processor acquires the lockthat was released.More formally, the following conditions are requiredfor ensuring release consistency:1. Before an ordinary load or store is allowed to per-form with respect to any other processor, all previ-ous acquires must be performed.2. Before a release is allowed to perform with respectto any other processor, all previous ordinary loadsand stores must be performed.3. Before an acquire is allowed to perform with respectto any other processor, all previous releases must beperformed. Before a release is allowed to performwith respect to any other processor, all previousacquires and releases must be performed.The term \all previous accesses" refers to all accessesby the same thread that precede the current access inprogram order. A load is said to have \performed with1We ignore chaotic data [19] in this presentation.2For example, only one thread can acquire a lock at a time, and athread attempting to acquire a lock must block until the acquireis successful.

respect to another processor" when a subsequent storeon that processor cannot a�ect the value returned bythe load. A store is said to have \performed with re-spect to another processor" when a subsequent load bythat processor will return the value stored (or the valuestored in a later store). A load or a store is said to have\performed" when it has performed with respect to allother processors.Previous DSM systems [3, 10, 17, 25, 27] are basedon sequential consistency [22]. Sequential consistencyrequires, roughly speaking, that each modi�cation to ashared object become visible immediately to the otherprocessors in the system. Release consistency postponesuntil the next release the time at which updates mustbecome visible. This allows updates to be bu�ered untilthat time, and avoids having to block a thread until it isguaranteed that its current update has become visibleeverywhere. Furthermore, if multiple updates need togo to the same destination, they can be coalesced intoa single message. The use of release consistency thusallows Munin to mask memory access latency and toreduce the number of messages required to keep memoryconsistent. This is important on a distributed memorymultiprocessor where remote memory access latency issigni�cant, and the cost of sending a message is high.To implement release consistency, Munin requiresthat all synchronization be done through system-supplied synchronization routines. We believe this isnot a major constraint, as many shared memory paral-lel programming environments already provide e�cientsynchronization packages. There is therefore little in-centive for programmers to implement separate mecha-nisms. Unlike DASH, Munin does not require that eachindividual shared memory access be marked.Gharachorloo et al. [16, 19] have shown that a largeclass of programs, essentially programs with \enough"synchronization, produce the same results on a release-consistent memory as on a sequentially-consistent mem-ory. Munin's multiple consistency protocols obey the or-dering requirements imposed by release consistency, so,like DASH programs, Munin programs with \enough"synchronization produce the same results under Muninas under a sequentially-consistent memory. The expe-rience with DASH and Munin indicates that almost allshared memory parallel programs satisfy this criterion.No modi�cations are necessary to these programs, otherthan making all synchronization operations utilize theMunin synchronization facilities.2.3 Multiple Consistency ProtocolsSeveral studies of shared memory parallel programshave indicated that no single consistency protocol isbest suited for all parallel programs [6, 14, 15]. Further-more, within a single program, di�erent shared variables

are accessed in di�erent ways and a particular variable'saccess pattern can change during execution [6].Munin allows a separate consistency protocol foreach shared variable, tuned to the access pattern ofthat particular variable. Moreover, the protocol for avariable can be changed over the course of the execu-tion of the program. Munin uses program annotations,currently provided by the programmer, to choose a con-sistency protocol suited to the expected access patternof each shared variable.The implementation of multiple protocols is dividedinto two parts: high-level sharing pattern annotationsand low-level protocol parameters. The high-level an-notations are speci�ed as part of a shared variable dec-laration. These annotations correspond to the expectedsharing pattern for the variable. The current prototypesupports a small collection of these annotations thatclosely correspond to the sharing patterns observed inour earlier study of shared memory access patterns [6].The low-level protocol parameters control speci�c as-pects of the individual protocols, such as whether anobject may be replicated or whether to use invalidationor update to maintain consistency. In Section 2.3.1,we discuss the low-level protocol parameters that canbe varied under Munin, and in Section 2.3.2 we discussthe high-level sharing patterns supported in the currentMunin prototype.2.3.1 Protocol ParametersMunin's consistency protocols are derived by varyingeight basic protocol parameters:� Invalidate or Update? (I) This parameter spec-i�es whether changes to an object should be propa-gated by invalidating or by updating remote copies.� Replicas allowed? (R) This parameter speci�eswhether more than one copy of an object can existin the system.� Delayed operations allowed? (D) This param-eter speci�es whether or not the system may delayupdates or invalidations when the object is modi-�ed.� Fixed owner? (FO) This parameter directsMunin not to propagate ownership of the object.The object may be replicated on reads, but allwrites must be sent to the owner, from where theymay be propagated to other nodes.� Multiple writers allowed? (M) This parame-ter speci�es that multiple threads may concurrentlymodify the object with or without intervening syn-chronization.

� Stable sharing pattern? (S) This parameter in-dicates that the object has a stable access pattern,i.e., the same threads access the object in the sameway during the entire execution of the program. Ifa di�erent thread attempts to access the object,Munin generates a runtime error. For stable shar-ing patterns, Munin always sends updates to thesame nodes. This allows updates to be propagatedto nodes prior to these nodes requesting the data.� Flush changes to owner? (Fl) This parameterdirects Munin to send changes only to the object'sowner and to invalidate the local copy wheneverthe local thread propagates changes.� Writable? (W) This parameter speci�es whetherthe shared object can be modi�ed. If a write is at-tempted to a non-writable object, Munin generatesa runtime error.2.3.2 Sharing AnnotationsSharing annotations are added to each shared variabledeclaration, to guide Munin in its selection of the pa-rameters of the protocol used to keep each object con-sistent. While these annotations are syntactically partof the variable's declaration, they are not programminglanguage types, and as such they do not nest or causecompile-time errors if misused. Incorrect annotationsmay result in ine�cient performance or in runtime er-rors that are detected by the Munin runtime system.Read-only objects are the simplest form of shareddata. Once they have been initialized, no further up-dates occur. Thus, the consistency protocol simply con-sists of replication on demand. A runtime error is gener-ated if a thread attempts to write to a read-only object.For migratory objects, a single thread performsmultiple accesses to the object, including one or morewrites, before another thread accesses the object [29].Such an access pattern is typical of shared objects thatare accessed only inside a critical section. The consis-tency protocol for migratory objects is to migrate theobject to the new thread, provide it with read and writeaccess (even if the �rst access is a read), and invalidatethe original copy. This protocol avoids a write missand a message to invalidate the old copy when the newthread �rst modi�es the object.Write-shared objects are concurrently written bymultiple threads, without the writes being synchro-nized, because the programmer knows that the updatesmodify separate words of the object. However, becauseof the way that objects are laid out in memory, theremay be false sharing. False sharing occurs when twoshared variables reside in the same consistency unit,such as a cache block or a virtual memory page. In sys-tems that do not support multiple writers to an object,

the consistency unit may be exchanged between proces-sors even though the processors are accessing di�erentobjects.Producer-consumer objects are written (pro-duced) by one thread and read (consumed) by one ormore other threads. The producer-consumer consis-tency protocol is to replicate the object, and to update,rather than invalidate, the consumer's copies of the ob-ject when the object is modi�ed by the producer. Thiseliminates read misses by the consumer threads. Re-lease consistency allows the producer's updates to bebu�ered until the producer releases the lock that pro-tects the objects. At that point, all of the changes canbe passed to the consumer threads in a single mes-sage. Furthermore, producer-consumer objects havestable sharing relationships, so the system can deter-mine once which nodes need to receive updates of anobject, and use that information thereafter. If the shar-ing pattern changes unexpectedly, a runtime error isgenerated.Reduction objects are accessed via Fetch and �operations. Such operations are equivalent to a lock ac-quisition, a read followed by a write of the object, anda lock release. An example of a reduction object is theglobal minimum in a parallel minimum path algorithm,which would be maintained via a Fetch and min. Re-duction objects are implemented using a �xed-ownerprotocol.Result objects are accessed in phases. They arealternately modi�ed in parallel by multiple threads, fol-lowed by a phase in which a single thread accesses themexclusively. The problem with treating these objects asstandard write-shared objects is that when the multiplethreads complete execution, they unnecessarily updatethe other copies. Instead, updates to result objects aresent back only to the single thread that requires exclu-sive access.Conventional objects are replicated on demandand are kept consistent by requiring a writer to be thesole owner before it can modify the object. Upon awrite miss, an invalidation message is transmitted toall other replicas. The thread that generated the missblocks until it has the only copy in the system [24]. Ashared object is considered conventional if no annota-tion is provided by the programmer.The combination of protocol parameter settings foreach annotation is summarized in Table 1.New sharing annotations can be added easily bymodifying the preprocessor that parses the Munin pro-gram annotations. For instance, we have consid-ered supporting an invalidation-based protocol withdelayed invalidations and multiple writers, essentiallyinvalidation-based write-shared objects, but we havechosen not to implement such a protocol until we en-counter a need for it.

Sharing Parameter SettingsAnnotation I R D FO M S Fl WRead-only N Y - - - - - NMigratory Y N - N N - N YWrite-shared N Y Y N Y N N YProducer-Consumer N Y Y N Y Y N YReduction N Y N Y N - N YResult N Y Y Y Y - Y YConventional Y Y N N N - N YTable 1 Munin Annotations andCorresponding Protocol Parameters2.4 Advanced ProgrammingFor Matrix Multiply and Successive Over-Relaxation,the two Munin programs discussed in this paper, sim-ply annotating each shared variable's declaration witha sharing pattern is su�cient to achieve performancecomparable to a hand-coded message passing version.Munin also provides a small collection of library rou-tines that allow the programmer to �ne-tune variousaspects of Munin's operation. These \hints" are op-tional performance optimizations.In Munin, the programmer can specify the logi-cal connections between shared variables and the syn-chronization objects that protect them [27]. Cur-rently, this information is provided by the user using anAssociateDataAndSynch() call. If Munin knows whichobjects are protected by a particular lock, the requiredconsistency information is included in the message thatpasses lock ownership. For example, if access to a par-ticular object is protected by a particular lock, such asan object accessed only inside a critical section, Muninsends the new value of the object in the message thatis used to pass lock ownership. This avoids one or moreaccess misses when the new lock owner �rst accesses theprotected data.The PhaseChange() library routine purges the ac-cumulated sharing relationship information (i.e., whatthreads are accessing what objects in a producer-consumer situation). This call is meant to supportadaptive grid or sparse matrix programs in whichthe sharing relationships are stable for long periodsof time between problem re-distribution phases. Theshared matrices can be declared producer-consumer,which requires that the sharing behavior be stable, andPhaseChange() can then be invoked whenever the shar-ing relationships change.ChangeAnnotation() modi�es the expected shar-ing pattern of a variable and hence the protocol usedto keep it consistent. This lets the system adapt todynamic changes in the way a particular object is ac-

cessed. Since the sharing pattern of an object is an in-dication to the system of the consistency protocol thatshould be used to maintain consistency, the invocationof ChangeAnnotation()may require the system to per-form some immediate work to bring the current state ofthe object up-to-date with its new sharing pattern.Invalidate() deletes the local copy of an object,and migrates it elsewhere if it is the sole copy or updatesremote copies with any changes that may have occurred.Flush() advises Munin to ush any bu�ered writesimmediately rather than waiting for a release.SingleObject() advises Munin to treat a large(multi-page) variable as a single object rather thanbreaking it into smaller page-sized objects.Finally, PreAcquire() advises Munin to acquire alocal copy of a particular object in anticipation of fu-ture use, thus avoiding the latency caused by subsequentread misses.3 Implementation3.1 OverviewMunin executes a distributed directory-based cache con-sistency protocol [1] in software, in which each directoryentry corresponds to a single object. Munin also im-plements locks and barriers, using a distributed queue-based synchronization protocol [20, 26].During compilation, the sharing annotations areread by the Munin preprocessor, and an auxiliary �leis created for each input �le. These auxiliary �les areused by the linker to create a shared data segment anda shared data description table, which are appended tothe Munin executable �le. The program is then linkedwith the Munin runtime library.When the application program is invoked, the Muninroot thread starts running. It initializes the shareddata segment, creates Munin worker threads to han-dle consistency and synchronization functions, and reg-isters itself with the kernel as the address space'spage fault handler (as is done by Mach's externalpagers [28]). It then executes the user initialization rou-tine user init(), spawns the number of remote copiesof the program speci�ed by user init(), and initializesthe remote shared data segments. Finally, the Muninroot thread creates and runs the user root thread. Theuser root thread in turn creates user threads on the re-mote nodes.Whenever a user thread has an access miss or ex-ecutes a synchronization operation, the Munin rootthread is invoked. The Munin root thread may callon one of the local Munin worker threads or a remoteMunin root thread to perform the necessary operations.Afterwards, it resumes the user thread.

3.2 Data Object DirectoryThe data object directory within each Munin nodemaintains information about the state of the globalshared memory. This directory is a hash table thatmaps an address in the shared address space to the en-try that describes the object located at that address.The data object directory on the Munin root node isinitialized from the shared data description table foundin the executable �le, whereas the data object directoryon the other nodes is initially empty. When Munin can-not �nd an object directory entry in the local hash table,it requests a copy from the object's home node, whichfor statically de�ned objects is the root node. Objectdirectory entries contain the following �elds:� Start address and Size: used as the key for look-ing up the object's directory entry in a hash table,given an address within the object.� Protocol parameter bits: represent the protocolparameters described in Section 2.3.1.� Object state bits: characterize the dynamicstate of the object, e.g., whether the local copy isvalid, writable, or modi�ed since the last ush, andwhether a remote copy of the object exists.� Copyset: used to specify which remote processorshave copies of the object that must be updatedor invalidated. For a small system, such as ourprototype, a bitmap of the remote processors issu�cient.3� Synchq (optional): a pointer to the synchroniza-tion object that controls access to the object (seeSection 2.4).� Probable owner (optional): used as a \bestguess" to reduce the overhead of determining theidentity of the Munin node that currently ownsthe object [24]. The identity of the owner node isused by the ownership-based protocols (migratory,conventional, and reduction), and is also usedwhen an object is locked in place (reduction) orwhen the changes to the object should be ushedonly to its owner (result).� Home node (optional): the node at which the ob-ject was created. It is used for a few record keepingfunctions and as the node of last resort if the sys-tem ever attempts to invalidate all remote copiesof an object.3This approach does not scale well to larger systems, but an ear-lier study of parallel programs suggests that a processor list isoften quite short [29]. The common exception to this rule oc-curs when an object is shared by every processor, and a specialAll Nodes value can be used to indicate this case.

� Access control semaphore: provides mutuallyexclusive access to the object's directory entry.� Links: used for hashing and enqueueing the ob-ject's directory entry.3.3 Delayed Update QueueThe delayed update queue (DUQ) is used to bu�er pend-ing outgoing write operations as part of Munin's soft-ware implementation of release consistency. A write toan object that allows delayed updates, as speci�ed bythe protocol parameter bits, is stored in the DUQ. TheDUQ is ushed whenever a local thread releases a lockor arrives at a barrier.Munin uses the virtual memory hardware to detectand enqueue changes to objects. Initially, and after eachtime that the DUQ is ushed, the shared objects han-dled by the DUQ are write-protected using the virtualmemory hardware. When a thread �rst attempts towrite to such an object, the resulting protection faultinvokes Munin. The object's directory entry is put onthe DUQ, write-protection is removed so that subse-quent writes do not experience consistency overhead,and the faulting thread is resumed. If multiple writersare allowed on the object, a copy (twin) of the object isalso made. This twin is used to determine which wordswithin the object have been modi�ed since the last up-date.When a thread releases a lock or reaches a barrier,the modi�cations to the objects enqueued on the DUQare propagated to their remote copies.4 The set of re-mote copies is either immediately available in the Copy-set in the data object directory, or it must be dynami-cally determined. The algorithm that we currently useto dynamically determine the Copyset is somewhat in-e�cient. We have devised, but not yet implemented, animproved algorithm that uses the owner node to collectCopyset information. Currently, a message indicatingwhich objects have been modi�ed locally is sent to allother nodes. Each node replies with a message indicat-ing the subset of these objects for which it has a copy. Ifthe protocol parameters indicate that the sharing rela-tionship is stable, this determination is performed onlyonce.If an enqueued object does not have a twin (i.e., mul-tiple writers are not allowed), Munin transmits updatesor invalidations to nodes with remote copies, as indi-cated by the invalidate protocol parameter bit in theobject's directory entry. If the object does have a twin,Munin performs a word-by-word comparison of the ob-ject and its twin, and run-length encodes the results of4This is a conservative implementation of release consistency, be-cause the updates are propagated at the time of the release, ratherthan being delayed until the release is performed (see Section 2.2).

this \di�" into the space allocated for the twin. Eachrun consists of a count of identical words, the numberof di�ering words that follow, and the data associatedwith those di�ering words. The encoded object is sentto the nodes that require updates, where the object isdecoded and the changes merged into the original ob-ject. If a Munin node with a dirty copy of an objectreceives an update for that object, it incorporates thechanges immediately. If a Munin node with a dirty copyof an object receives an invalidation request for that ob-ject and multiple writers are allowed, any pending localupdates are propagated. Otherwise, a runtime error isgenerated.This approach works well when there are multiplewrites to an object between DUQ ushes, which allowsthe expense of the copy and subsequent comparison tobe amortized over a large number of write accesses. Ta-ble 2 breaks down the time to handle updates to an8-kilobyte object through the DUQ. This includes thetime to handle a fault (including resuming the thread),make a copy of the object, encode changes to the ob-ject, transmit them to a single remote node, decodethem remotely, and reply to the original sender. Wepresent the results for three di�erent modi�cation pat-terns. In the �rst pattern, a single word within theobject has changed. In the second, every word in theobject has changed. In the third, every other word haschanged, which is the worst case for our run-length en-coding scheme because there are a maximumnumber ofminimum-length runs.We considered and rejected two other approaches forimplementing release consistency in software:1. Force the thread to page fault on every write to areplicated object so that the modi�ed words can bequeued as they are accessed.2. Have the compiler add code to log writes to repli-cated objects as part of the write.One All AlternateComponent Word Words WordsHandle Fault 2.01 2.01 2.01Copy object 1.15 1.15 1.15Encode object 3.07 4.79 6.57Transmit object 1.72 12.47 12.47Decode object 3.12 4.86 6.68Reply 2.27 2.27 2.27Total 13.34 27.55 31.15Table 2 Time to Handle an 8-kilobyteObject through DUQ (msec.)

The �rst approach works well if an object is only modi-�ed a small number of times between DUQ ushes, or ifthe page fault handling code can be made extremelyfast. Since it is quite common for an object to beupdated multiple times between DUQ ushes [6], theadded overhead of handling multiple page faults makesthis approach generally unacceptable. The second ap-proach was used successfully by the Emerald system [9].We chose not to explore this approach in the prototypebecause we have a relatively fast page fault handler,and we did not want to modify the compiler. This ap-proach is an attractive alternative for systems that donot support fast page fault handling or modi�cation ofvirtual memory mappings, such as the iPSC-i860 hy-percube [12]. However, if the number of writes to aparticular object between DUQ ushes is high, as is of-ten the case [6], this approach will perform relativelypoorly because each write to a shared object will beslowed. We intend to study this approach more closelyin future system implementations.3.4 Synchronization SupportSynchronization objects are accessed in a fundamen-tally di�erent way than data objects [6], so Munindoes not provide synchronization through shared mem-ory. Rather, each Munin node interacts with the othernodes to provide a high-level synchronization service.Munin provides support for distributed locks and bar-riers. More elaborate synchronization objects, such asmonitors and atomic integers, can be built using thesebasic mechanisms.Munin employs a queue-based implementation oflocks, which allows a thread to request ownership ofa lock and then to block awaiting a reply without re-peated queries. Munin uses a synchronization objectdirectory, analogous to the data object directory, tomaintain information about the state of the synchro-nization objects. For each lock, a queue identi�es theuser threads waiting for the lock, so a release-acquirepair can be performed with a single message exchangeif the acquire is pending when the release occurs. Thequeue itself is distributed to improve scalability.When a thread wants to acquire a lock, it callsAcquireLock(), which invokes Munin to �nd the lock inthe synchronization object directory. If the lock is localand free, the thread immediately acquires the lock andcontinues executing. If the lock is not free, Munin sendsa request to the probable lock owner to �nd the actualowner, possibly requiring the request to be forwardedmultiple times. When the request arrives at the ownernode, ownership is forwarded directly to the requesterif the lock is free. Otherwise, the owner forwards therequest to the thread at the end of the queue, whichputs the requesting thread on the lock's queue. Each

enqueued thread knows only the identity of the threadthat follows it on the queue. When a thread performsan Unlock() and the associated queue is non-empty,lock ownership is forwarded directly to the thread atthe head of the queue.To wait at a barrier, a thread callsWaitAtBarrier(), which causes a message to be sentto the owner node, and the thread to be blocked await-ing a reply. When the Munin root thread on the ownernode has received messages from the speci�ed number ofthreads, it replies to the blocked threads, causing themto be resumed. For future implementations on largersystems, we envision the use of barrier trees and othermore scalable schemes [21].4 PerformanceWe have measured the performance of two Munin pro-grams, Matrix Multiply and Successive Over-Relaxation(SOR). We have also hand-coded the same programs onthe same hardware using the underlying message pass-ing primitives. We have taken special care to ensurethat the actual computational components of both ver-sions of each program are identical. This section de-scribes in detail the actions of the Munin runtime sys-tem during the execution of these two programs, andreports the performance of both versions of these pro-grams. Both programs make use of the DUQ to miti-gate the e�ects of false sharing and thus improve perfor-mance. They also exploit Munin's multiple consistencyprotocols to reduce the consistency maintenance over-head.4.1 Matrix MultiplyThe shared variables in Matrix Multiply are declared asfollows:shared read only int input1[N][N];shared read only int input2[N][N];shared result int output[N][N];The user init() routine initializes the input matricesand creates a barrier via a call to CreateBarrier().After creating worker threads, the user root threadwaits on the barrier by calling WaitAtBarrier(). Eachworker thread computes a portion of the output matrixusing a standard (sub)matrix multiply routine. Whena worker thread completes its computation, it performsa WaitAtBarrier() call. After all workers reach thebarrier, the program terminates. The program does notutilize any of the optimizations described in Section 2.4.On the root node, the input matrices are mappedas read-only, based on the read-only annotation, andthe output matrix is mapped as read-write, based on theresult annotation. When a worker thread �rst accesses

an input matrix, the resulting page fault is handled bythe Munin root thread on that node. It acquires a copyof the object from the root node, maps it as read-only,and resumes the faulted thread. Similarly, when a re-mote worker thread �rst writes to the output matrix, apage fault occurs. A copy of that page of the outputmatrix is then obtained from the root node, and thecopy is mapped as read-write. Since the output ma-trix is a result object, there may be multiple writers,and updates may be delayed. Thus, Munin makes atwin, and inserts the output matrix's object descriptorin the DUQ. When a worker thread completes its com-putation and performs a WaitAtBarrier() call, Munin ushes the DUQ. Since the output matrix is a resultobject, Munin sends the modi�cations only to the owner(the node where the root thread is executing), and in-validates the local copy.Table 3 gives the execution times of both the Muninand the message passing implementations for multiply-ing two 400 � 400 matrices. The System time repre-sents the time spent executing Munin code on the rootnode, while the User time is that spent executing usercode. In all cases, the performance of the Munin ver-sion was within 10% of that of the hand-coded messagepassing version. Program execution times for represen-tative numbers of processors are shown. The programbehaved similarly for all numbers of processors from oneto sixteen.Since di�erent portions of the output matrix aremodi�ed concurrently by di�erent worker threads, thereis false sharing of the output matrix. Munin's provisionfor multiple writers reduces the adverse e�ects of thisfalse sharing. As a result, the data motion exhibitedby the Munin version of Matrix Multiply is nearly iden-tical to that exhibited by the message passing version.In the Munin version, after the workers have acquiredtheir input data, they execute independently, withoutcommunication, as in the message passing version. Fur-thermore, the various parts of the output matrix aresent from the node where they are computed to the rootnode, again as in the message passing version. The onlydi�erence between the two versions is that in Munin theappropriate parts of the input matrices are paged in,# of DM Munin %Procs Total Total System User Di�1 753.15 | | | |2 378.74 382.15 1.21 376.24 0.94 192.21 196.92 2.45 191.26 2.58 101.57 105.73 5.82 97.31 4.116 66.31 72.41 8.51 54.19 9.2Table 3 Performance of Matrix Multiply(sec.)

while in the message passing version they are sent dur-ing initialization. The additional overhead present inthe Munin version comes from the page fault handlingand the copying, encoding, and decoding of the outputmatrix. In a DSM system that does not support multi-ple writers to an object, portions of the output matrixcould \ping-pong" between worker threads.The performance of Matrix Multiply can be opti-mized by utilizing one of the performance optimiza-tions discussed in Section 2.4. If Munin is told totreat the �rst input array as a single object rather thanbreaking it into smaller page-sized objects, via a call toSingleObject(), the entire input array is transmittedto each worker thread when the array is �rst accessed.Overhead is lowered by reducing the number of accessmisses. This improves the performance of the Muninversion to within 2% of of the hand-coded message pass-ing version. Execution times re ecting this optimizationare shown in Table 4.4.2 Successive Over-RelaxationSOR is used to model natural phenomena. An exampleof an SOR application is determining the temperaturegradients over a square area, given the temperature val-ues at the area boundaries. The basic SOR algorithm isiterative. The area is divided into a grid of points, rep-resented by a matrix, at the desired level of granularity.Each matrix element corresponds to a grid point. Dur-ing each iteration, each matrix element is updated tobe the average of its four nearest neighbors. To avoidoverwriting the old value of a matrix element beforeit is used, the program can either use a scratch arrayor only compute every other element per iteration (so-called \red-black" SOR). Both techniques work equallywell under Munin. Our example employs the scratcharray approach.SOR is parallelized by dividing the area into sec-tions and having a worker thread compute the valuesfor each section. Newly computed values at the sectionboundaries must be exchanged with adjacent sectionsat the end of each iteration. This exchange engenders a# of DM Munin %Procs Total Total System User Di�1 753.15 | | | |2 378.74 380.51 0.34 376.16 0.54 192.21 194.27 0.57 190.15 1.18 101.57 102.84 0.87 97.21 1.316 66.31 67.21 1.26 54.18 1.4Table 4 Performance of Optimized MatrixMultiply (sec.)

producer-consumer relationship between grid points atthe boundaries of adjacent sections.In the Munin version of SOR, the user root threadcreates a worker thread for each section. The matrixrepresenting the grid is annotated asshared producer consumer int matrix [...]The programmer is not required to specify the data par-titioning to the runtime system. After each iteration,worker threads synchronize by waiting at a barrier. Af-ter all workers have completed all iterations, the pro-gram terminates. The Munin version of SOR did notutilize any of the optimizations described in Section 2.4.A detailed analysis of the execution, which exempli-�es how producer-consumer sharing is currently handledby Munin, follows:� During the �rst compute phase, when the new av-erage of the neighbors is computed in the scratcharray, the nodes read-fault in copies of the pages ofthe matrix as needed.� During the �rst copy phase, when the newly com-puted values are copied to the matrix, nodes write-fault, enqueue the appropriate pages on the DUQ,create twins of these pages, make the originals read-write, and resume.� When the �rst copy phase ends and the workerthread waits at the barrier, the sharing relation-ships between producer and consumer are deter-mined as described in Section 3.3. Afterwards, anypages that have an empty Copyset, and are there-fore private, are made locally writable, their twinsare deleted, and they do not generate further accessfaults. In our SOR example, all non-edge elementsof each section are handled in this manner.� Since the sharing relationships of producer-consumer objects are stable, after all subsequentcopy phases, updates to shared portions of the ma-trix (the edge elements of each section) are propa-gated only to those nodes that require the updateddata (those nodes handling adjacent sections). Ateach subsequent synchronization point, the updatemechanism automatically combines the elementsdestined for the same node into a single message.Table 5 gives the execution times of both the Muninand the message passing implementation of 100 itera-tions of SOR on a 512� 512 matrix, for representativenumbers of processors. In all cases, the performance ofthe Munin version was within 10% of that of the hand-coded message passing version. Again, the program be-haved similarly for all numbers of processors from oneto sixteen.

# of DM Munin %Procs Total Total System User Di�1 122.62 | | | |2 63.68 66.87 2.27 61.83 5.54 36.46 39.70 3.58 31.77 8.98 26.72 28.26 5.17 16.57 5.816 25.62 27.64 8.32 8.51 7.9Table 5 Performance of SOR (sec.)Since the matrix elements that each thread accessesoverlap with the elements that its neighboring threadsaccess, the sharing is very �ne-grained and there is con-siderable false sharing. After the �rst pass, which in-volves an extra phase to determine the sharing relation-ships, the data motion in the Munin version of SOR isessentially identical to the message passing implemen-tation. The only extra overhead comes from the faulthandling and from the copying, coding, and decoding ofthe shared portions of the matrix.4.3 E�ect of Multiple ProtocolsWe studied the importance of having multiple protocolsby comparing the performance of the multi-protocol im-plementation with the performance of an implementa-tion using only conventional or only write-sharedobjects. Conventional objects result in an ownership-based write-invalidate protocol being used, similar tothe one implemented in Ivy [24]. We also chosewrite-shared because it supports multiple writers and�ne-grained sharing.The execution times for the unoptimized version ofMatrix Multiply (see Table 4) and SOR, for the previ-ous problem sizes and for 16 processors, are presentedin Table 6. For Matrix Multiply, the use of result andread only sped up the time required to load the in-put matrices and later purge the output matrix back tothe root node and resulted in a 4.4% performance im-provement over write-shared and a 4.8% performanceimprovement over conventional. For SOR, the use ofproducer-consumer reduced the consistency overhead,by removing the phase in which sharing relationshipsare determined for all but the �rst iteration. The result-Protocol Matrix Multiply SORMultiple 72.41 27.64Write-shared 75.59 64.48Conventional 75.85 67.64Table 6 E�ect of Multiple Protocols (sec.)

ing execution time was less than half that of the imple-mentations using only conventional or write-shared.The execution time for SOR using write-shared can beimproved by using an better algorithm for determiningthe Copyset (see Section 3.3).4.4 SummaryFor Matrix Multiply, after initialization, each workerthread transmits only a single result message back tothe root node, which is the same communication patternfound in a hand-coded message passing version of theprogram. For SOR, there is only one message exchangebetween adjacent sections per iteration (after the �rstiteration), again, just as in the message passing version.The common problem of false sharing of large ob-jects (or pages), which can hamper the performance ofDSM systems, is relatively benign under Munin becausewe do not enforce a single-writer restriction on objectsthat do not require it. Thus, intertwined access re-gions and non-page-aligned data are less of a problem inMunin than with other DSM systems. The overhead in-troduced by Munin in both Matrix Multiply and SOR,other than the determination of the sharing relation-ships after the �rst iteration of SOR, comes from theexpense of encoding and decoding modi�ed objects.By adding only minor annotations to the sharedmemory programs, the resulting Munin programs exe-cuted almost as e�ciently as the corresponding messagepassing versions. In fact, during our initial testing, theperformance of the Munin programs was better than theperformance of the message passing versions. Only af-ter careful tuning of the message passing versions werewe able to generate message passing programs that re-sulted in the performance data presented here. Thisanecdote emphasizes the di�culty of writing e�cientmessage passing programs, and serves to emphasize thevalue of a DSM system like Munin.5 Related WorkA number of software DSM systems have been devel-oped [3, 8, 10, 17, 25, 27]. All, except Midway [8], usesequential consistency [22]. Munin's use of release con-sistency only requires consistency to be enforced at spe-ci�c synchronization points, with the resulting reduc-tion in latency and number of messages exchanged.Ivy uses a single-writer, write-invalidate protocol,with virtual memory pages as the units of consis-tency [25]. The large size of the consistency unit makesthe system prone to false sharing. In addition, thesingle-writer nature of the protocol can cause a \ping-pong" behavior between multiple writers of a sharedpage. It is then up to the programmer or the compiler

to lay out the program data structures in the sharedaddress space such that false sharing is reduced.Clouds performs consistency management on a per-object basis, or in Clouds terminology, on a per-segmentbasis [27]. Clouds allows a segment to be locked by aprocessor, to avoid the \ping-pong" e�ects that may re-sult from false sharing. Mirage also attempts to avoidthese e�ects by locking a page with a certain proces-sor for a certain � time window [17]. Munin's use ofmultiple-writer protocols avoids the adverse e�ects offalse sharing, without introducing the delays caused bylocking a segment to a processor.Orca is also an object-oriented DSM system, but itsconsistency management is based on an e�cient reliableordered broadcast protocol [3]. For reasons of scalabil-ity, Munin does not rely on broadcast. In Orca, bothinvalidate and update protocols can be used. Muninalso supports a wider variety of protocols.Unlike the designs discussed above, in Amber theprogrammer is responsible for the distribution of dataamong processors [10]. The system does not attempt toautomatically move or replicate data. Good speedupsare reported for SOR running on Amber. Munin auto-mates many aspects of data distribution, and still re-mains e�cient by asking the programmer to specify theexpected access patterns for shared data variables.Linda provides a di�erent abstraction for distributedmemory programming: all shared variables reside in atuple space, and the only operations allowed are atomicinsertion, removal, and reading of objects from the tu-ple space [18]. Munin stays closer to the more familiarshared memory programming model, hopefully improv-ing its acceptance with parallel programmers.Midway [8] proposes a DSM system with entry con-sistency, a memory consistency model weaker than re-lease consistency. The goal of Midway is to minimizecommunications costs by aggressively exploiting the re-lationship between shared variables and the synchro-nization objects that protect them.Recently, designs for hardware distributed sharedmemory machines have been published [2, 23]. Ourwork is most related to the DASH project [23], fromwhich we adapt the concept of release consistency. Un-like Munin, though, DASH uses a write-invalidate pro-tocol for all consistency maintenance. Munin uses the exibility of its software implementation to also attackthe problem of read misses by allowing multiple writersto a single shared object and by using update proto-cols (producer-consumer, write-shared, result) andpre-invalidation (migratory) when appropriate. TheAPRIL machine takes a di�erent approach in combat-ting the latency problem on distributed shared memorymachines [2]. APRIL provides sequential consistency,but relies on extremely fast processor switching to over-lap memory latency with computation.

A technique similar to the delayed update queuewas used by the Myrias SPS multiprocessor [13]. Itperformed the copy-on-write and di� in hardware, butrequired a restricted form of parallelism to ensure cor-rectness.Munin's implementation of locks is similar to ex-isting implementations on shared memory multiproces-sors [20, 26].An alternative approach for parallel processing ondistributed memory machines is to have the compilerproduce a message passing program starting from a se-quential program, annotated by the programmer withdata partitions [4, 30]. Given the static nature ofcompile-time analysis, these techniques appear to be re-stricted to numerical computations with statically de-�ned shared memory access patterns.6 Conclusions and Future WorkThe objective of the Munin project is to build a DSMsystem in which shared memory parallel programs exe-cute on a distributed memorymachine and achieve goodperformance without the programmer having to makeextensive modi�cations to the shared memory program.Munin's shared memory is di�erent from \real" sharedmemory only in that it provides a release-consistentmemory interface, and in that the shared variables areannotated with their expected access patterns. In theapplications that we have programmed in Munin so far,the release-consistent memory interface has required nochanges, while the annotations have proved to be only aminor chore. Munin programming has been easier thanmessage passing programming. Nevertheless, we haveachieved performance within 5{10 percent of messagepassing implementations of the same applications. Weargue that this cost in performance is a small price topay for the resulting reduction in program complexity.Further work on Munin will continue to examine thetradeo� between performance and programming com-plexity. We are interested in examining whether mem-ory consistency can be relaxed further, without neces-sitating more program modi�cations than release con-sistency. We are also considering more aggressive im-plementation techniques, such as the use of a pendingupdates queue to hold incoming updates, a dual to thedelayed update queue already in use. We also wishto design higher-level interfaces to distributed sharedmemory in which the access patterns will be determinedwithout user annotation. Another important issue isMunin's scalability in terms of processor speed, inter-connect bandwidth, and number of processors. To ex-plore this issue, we intend to implement Munin on suit-able hardware platforms such as a Touchstone-class ma-chine or a high-speed network of supercomputer work-

stations. In this vein, we are also studying hardwaresupport for selected features of Munin.AcknowledgementsDave Johnson provided important guidance during theprototype implementation and his feedback throughoutthe project was very useful. Peter Ostrin developed andtimed the distributed memory programs reported in thispaper.The authors wish to thank the anonymous refereesfor their many helpful comments. Brian Bershad, El-mootazbellah Elnozahy, and Matt Zekauskas also pro-vided many suggestions for improving the paper.References[1] A. Agarwal, R. Simoni, J. Hennessy, andM. Horowitz. An evaluation of directory schemesfor cache coherence. In Proceedings of the 15thAnnual International Symposium on ComputerArchitecture, pages 280{289, June 1988.[2] Anant Agarwal, Beng-Hong Lim, David Kranz,and John Kubiatowicz. APRIL: A processor ar-chitecture for multiprocessing. In Proceedingsof the 17th Annual International Symposium onComputer Architecture, pages 104{114, May 1990.[3] Henri E. Bal and Andrew S. Tanenbaum. Dis-tributed programming with shared data. In Pro-ceedings of the IEEE CS 1988 International Con-ference on Computer Languages, pages 82{91, Oc-tober 1988.[4] Vasanth Balasundaram, Geo�rey Fox, KenKennedy, and Uli Kremer. An interactive envi-ronment for data partitioning and distribution. InProceedings of the 5th Distributed Memory Com-puting Conference, Charleston, South Carolina,April 1990.[5] John K. Bennett, John B. Carter, and WillyZwaenepoel. Munin: Distributed shared mem-ory based on type{speci�c memory coherence. InProceedings of the 1990 Conference on the Princi-ples and Practice of Parallel Programming, March1990.[6] John K. Bennett, John B. Carter, and WillyZwaenepoel. Adaptive software cache manage-ment for distributed shared memory architec-tures. In Proceedings of the 17th Annual Inter-national Symposium on Computer Architecture,pages 125{134, May 1990.

[7] Brian N. Bershad, Edward D. Lazowska, andHenry M. Levy. PRESTO: A system forobject-oriented parallel programming. Software|Practice and Experience, 18(8):713{732, August1988.[8] Brian N. Bershad and Matthew J. Zekauskas.Shared memory parallel programming with en-try consistency for distributed memory multi-processors. Technical Report CMU-CS-91-170,Carnegie-Mellon University, September 1991.[9] Andrew Black, Norman Hutchinson, Eric Jul, andHenry Levy. Object structure in the Emeraldsystem. In Proceedings of the ACM Conferenceon Object-Oriented Programming Systems, Lan-guages and Applications, pages 78{86, October1986. Special Issue of SIGPLAN Notices, Volume21, Number 11, November, 1986.[10] Je�rey S. Chase, Franz G. Amador, Edward D.Lazowska, Henry M. Levy, and Richard J. Little-�eld. The Amber system: Parallel programmingon a network of multiprocessors. In Proceedings ofthe 12th ACM Symposium on Operating SystemsPrinciples, pages 147{158, December 1989.[11] David R. Cheriton and Willy Zwaenepoel. Thedistributed V kernel and its performance for disk-less workstations. In Proceedings of the 9thACM Symposium on Operating Systems Princi-ples, pages 129{140, October 1983.[12] Intel Corporation. i860 64-bit MicroprocessorProgrammer's Manual. Santa Clara, California,1990.[13] Myrias Corporation. System Overview. Edmon-ton, Alberta, 1990.[14] Susan J. Eggers and Randy H. Katz. A char-acterization of sharing in parallel programs andits application to coherency protocol evaluation.In Proceedings of the 15th Annual InternationalSymposium on Computer Architecture, pages 373{383, May 1988.[15] Susan J. Eggers and Randy H. Katz. The e�ectof sharing on the cache and bus performance ofparallel programs. In Proceedings of the 3rd In-ternational Conference on Architectural Supportfor Programming Languages and Systems, pages257{270, April 1989.[16] Kourosh Gharachorloo, Anoop Gupta, and JohnHennessy. Performance evaluations of memoryconsistency models for shared-memory multipro-cessors. In Proceedings of the 4th InternationalConference on Architectural Support for Program-ming Languages and Systems, April 1991.

[17] Brett D. Fleisch and Gerald J. Popek. Mirage:A coherent distributed shared memory design. InProceedings of the 12th ACM Symposium on Op-erating Systems Principles, pages 211{23, Decem-ber 1989.[18] David Gelernter. Generative communication inLinda. ACM Transactions on Programming Lan-guages and Systems, 7(1):80{112, January 1985.[19] Kourosh Gharachorloo, Daniel Lenoski, JamesLaudon, PhillipGibbons, Anoop Gupta, and JohnHennessy. Memory consistency and event order-ing in scalable shared-memory multiprocessors.In Proceedings of the 17th Annual InternationalSymposium on Computer Architecture, pages 15{26, May 1990.[20] James R. Goodman, Mary K. Vernon, andPhilip J. Woest. E�cient synchronization prim-itives for large-scale cache-coherent multiproces-sor. In Proceedings of the 3rd International Con-ference on Architectural Support for ProgrammingLanguages and Systems, pages 64{75, April 1989.[21] Debra Hensgen, Raphael Finkel, and Udi Man-ber. Two algorithms for barrier synchronization.International Journal of Parallel Programming,17(1):1{17, January 1988.[22] Leslie Lamport. How to make a multiproces-sor computer that correctly executes multiprocessprograms. IEEE Transactions on Computers, C-28(9):690{691, September 1979.[23] Dan Lenoski, James Laudon, Kourosh Gharachor-loo, Anoop Gupta, and John Hennessy. Thedirectory-based cache coherence protocol for theDASH multiprocessor. In Proceedings of the 17thAnnual International Symposium on ComputerArchitecture, pages 148{159, May 1990.[24] Kai Li. Shared Virtual Memory on Loosely Cou-pled Multiprocessors. PhD thesis, Yale University,September 1986.[25] Kai Li and Paul Hudak. Memory coher-ence in shared virtual memory systems. ACMTransactions on Computer Systems, 7(4):321{359, November 1989.[26] John M. Mellor-Crummey and Michael L. Scott.Synchronization without contention. In Proceed-ings of the 4th International Conference on Archi-tectural Support for Programming Languages andSystems, pages 269{278, April 1991.

[27] Umakishore Ramachandran, Mustaque Ahamad,and M. Yousef A. Khalidi. Coherence of dis-tributed shared memory: Unifying synchroniza-tion and data transfer. In Proceedings of the 1989Conference on Parallel Processing, pages II-160|II-169, June 1989.[28] Richard Rashid, Avadis Tevanian, Jr. MichaelYoung, David Golub, Robert Baron, DavidBlack, William Bolosky, and Jonathan Chew.Machine-independent virtual memory manage-ment for paged uniprocessor and multiprocessorarchitectures. IEEE Transactions on Computers,37(8):896{908, August 1988.[29] Wolf-Dietrich Weber and Anoop Gupta. Analysisof cache invalidation patterns in multiprocessors.In Proceedings of the 3rd International Conferenceon Architectural Support for Programming Lan-guages and Systems, pages 243{256, April 1989.[30] Hans P. Zima, Heinz J. Bast, and Michael Gerndt.Superb: A tool for semi-automatic SIMD/MIMDparallelization. Parallel Computing, 6:1{18, 1988.