Making distributed shared memory simple, yet efficient

12

Transcript of Making distributed shared memory simple, yet efficient

Making Distributed Shared Memory Simple, Yet E�cient

Mark Swanson

Leigh Stoller

John Carter

Computer Systems Laboratory

University of Utah

Abstract

Recent research on distributed shared memory(DSM) has focussed on improving performance byreducing the communication overhead of DSM. Fea-tures added include lazy release consistency-based co-herence protocols and new interfaces that give pro-grammers the ability to hand tune communication.These features have increased DSM performance atthe expense of requiring increasingly complex DSMsystems or increasingly cumbersome programming.They have also increased the computation overhead ofDSM, which has partially o�set the communication-related performance gains. We chose to implementa simple DSM system, Quarks, with an eye towardshiding most computation overhead while using a verylow latency transport layer to reduce the e�ect ofcommunication overhead. The resulting performanceis comparable to that of far more complex DSM sys-tems, such as Treadmarks and Cashmere.

1 Introduction

Distributed shared memory (DSM) is a software ab-straction of shared memory on a distributed memorymultiprocessor or workstation cluster. The �rst DSMsystem, Ivy [18], was fairly simple: data was trans-parently managed in page-grained chunks, normaldata and synchronization were treated identically,and all data was kept sequentially consistent [16].Although Ivy performed well on some applications,it performed poorly for programs with signi�cant

degrees of sharing, which limited its applicability.Since this time, the driving goal of DSM researchhas been to improve performance by any means nec-essary [1, 2, 4, 7, 9, 11, 13, 14, 21].The performance of DSM systems is dominated

by two issues: (i) the communication overhead re-quired to maintain consistency and (ii) the computa-

tion overhead required to service remote requests andcompute the information needed to maintain consis-tency. Over the years, there has been extensive re-search aimed at dramatically reducing the communi-

cation overhead required to keep DSM-managed dataconsistent. These e�orts have included:

� exploiting object encapsulation to give program-mers the ability to precisely specify the coher-ence unit [1, 2, 7, 9],

� allowing programmers to explicitly lock data tospeci�c nodes and thereby control at a �ne grainwhen data is communicated [9, 11],

� supporting multiple programmer-selected coher-ence protocols [4],

� supporting multiple concurrent writers to elimi-nate false sharing [4, 14], and

� using various forms of relaxed consistency mod-els, thereby increasing the restrictions imposedon the programmer but increasing the exibilityof the underlying DSM system in when it mustcommunicate [2, 4, 14].

These innovations improved DSM performance tovarying degrees, sometimes to the point where soft-

ware DSM systems can rival the performance of hard-ware DSM systems for moderately-grained applica-tions [8]. However, these e�orts to reduce communi-

cation overhead often increase the system complex-ity and computation overhead . Modern DSM sys-tems have added other overheads as well. Many mod-ern systems require programmers to write their pro-grams in very constrained ways [1, 2, 7, 13], in ef-fect making the DSM programmer perform much ofthe work required of message passing programmersin terms of specifying what data should be shippedwhen. Other DSM systems require complex runtimesystems or large amounts of DRAM for storing DSMstate [14, 20].To attack communication overhead, we imple-

mented a very low overhead communication subsys-tem using a Direct Deposit Protocol[23] developed aspart of the Avalanche multiprocessor project. In thispaper we explore the extent to which DSM perfor-mance can be improved by attacking the other sourceof DSM overhead: the computation overhead. Theavailability of a low latency, high bandwidth commu-nication layer allowed us to discard the traditionalassumption that the frequency and volume of com-munication is the overriding factor determining per-formance, to be avoided at the cost of additionalcomplexity in the basic protocols. Thus, rather thanmake heroic e�orts to reduce the number of DSMoperations performed, we decided to keep the coreDSM system simple but implement it in a way thathides most of the computation overhead. We did soby eliminating many high overhead system calls andoverlapping computation and communication when-ever possible. The result is the Quarks DSM sys-tem [15].DSM computation overhead comes in many forms

in di�erent systems, including:

� page fault and signal handling,

� system call overheads to protect and unprotectmemory,

� thread context switch overheads,

� the time spent handling incoming coherence re-quests instead of useful user computation,

� the time spent calculating what pieces of shareddata were modi�ed (so-called di� computation),

� copying data to/from communication bu�ers,and

� time spent blocked on synchronous I/O.

Many DSM systems have attacked individual com-ponents of this overhead. For example, CRL [13],Midway [2], and Shasta [20] eliminate memory pro-tection and page fault handling overheads by detect-ing memory references via annotated load and storeoperations. Unfortunately, these e�orts turned outto increase system overhead [20] or add cumbersomeprogramming requirements [2, 13].

Among the optimizations we applied to Quarks toreduce the impact of DSM computation overhead arethe following:

� We employed a release-consistent write-updateprotocol, similar to that employed by Munin [4],which allowed di� creation to be performed inparallel with communication.

� Local di�s are applied to the clean twins as theyare computed rather than copying a new twin onthe next write to a page.

� Incoming messages may be handled asyn-chronously, when the local node is idle or at aconvenient blocking point, rather than invokinga high overhead signal.

� Outgoing messages are sent using non-blockingI/O operations so that we can overlap di� com-putation with the transmission of di�s.

� We avoided using multithreading within theQuarks runtime system to support DSM opera-tions, because we found that the impact of con-text switching exceeded the bene�ts provided.

� Twin creation and memory protection-relatedsystem calls are performed while other pages arebeing requested (i.e., we use split-phase RPC todo useful work while loading a remote page).

� Di�s are computed and sent before acquiringlocks when the local node node is not the cur-rent owner (i.e., we overlap synchronization andDSM overhead).

The rest of this paper describes the detailed im-plementation of Quarks and the DDP communica-tion layer (Section 2), discusses Quarks' performance(Section 3), compares our work with related systems(Section 4), and draws conclusions (Section 5).

2 Design of Quarks

A software DSM system must provide a small num-ber of basic mechanisms: (i) shared memory creationand naming, (ii) shared memory consistency main-tenance, and (iii) synchronization. Quarks' basic ab-straction is that of shared regions, which are variable-length, page-aligned byte ranges. Quarks supportstwo consistency protocols: a release-consistent writeupdate protocol and a sequentially-consistent writeinvalidate protocol. For synchronization, Quarkssupports both locks and barriers, where locks aredistributed while barriers are centralized. Many ofQuarks' core mechanisms are derived from its prede-cessor, Munin[4].Conventional wisdom holds that minimizing com-

munication overhead is the key factor required tomake DSM systems e�cient. This has led to a widevariety of optimizations that reduce the number ofmessages required to maintain consistency, as de-scribed in Section 1. Recently, DSM systems haveemerged that exploit highly specialized communi-cation hardware to attack the problem of commu-nication overhead head-on [17, 21]. In these sys-tems, low message latency and high bandwidth com-munication translates directly into improved perfor-mance. Along these lines, when designing Quarks weexploited a very low latency messaging layer calledDirect Deposit (DD) that was developed as part ofthe Avalanche multiprocessor project. The detailsof our implementation of DD can be found in Sec-tion 2.1. Although the results presented in Section 3assume a DD communication layer with limited hard-ware support, we expect that the use of other mod-ern lightweight protocols, such as Fast Messages[19],

and commodity high-speed interconnects, such asMyrinet[3], would produce similar results.The availability of a low latency, high bandwidth

communication layer led us to focus on building sim-ple, easy to implement, computationally cheap, andportable mechanisms wherever possible, even if theyrequired additional messages. The only times mes-saging determined the design was when it a�ectedsynchronization between processes. In Section 2.2we discuss Quarks' design, followed by a descrip-tion of computation-reducing optimizations that ex-plored, both successful (Section 2.3) and unsucessful(Section 2.4).

2.1 Direct Deposit Based Communi-cation

The Quarks implementation described herein uses theDirect Deposit Protocol[23] as its transport layer.DD is an instance of a sender-based protocol[24](SBP). The key concept of SBPs is a connectionbased mechanism that enables the sender to managea reserved receive bu�er within the receiving process'address space that is obtained when the connectionis established. The sender directs placement of mes-sages within that bu�er via an o�set carried withinthe message header. Coupled with an appropriatenetwork interface, an SBP can achieve 0-copy mes-sage reception directly to the receiver's virtual ad-dress space.DD uses a system call-based interface for sends to

provide safety and exibility at overhead and latencycosts that are modest. The semantics of DD allowfor asynchronous sends, i.e., the send call can simplyrequest transmission of the message and return im-mediately. Given a network interface with DMA ca-panbility, the transmission actually occurs in parallelwith continued computation within the user process.Quarks makes heavy use of this.DD supports a completely user mode message re-

ception. On message arrival, a noti�cation object, ornote, is written into a circular queue speci�ed by theconnection. This queue can be in kernel or user mem-ory, and can be shared by several connections withina single process, at the discretion of the user. A userlevel receive consists simply of checking the valid ag

of the next note. Polling for incoming messages isthus an extremely inexpensive operation.When a note's valid bit is set, the user extracts

the message address, size, and connection identi�erfrom the note. As the message bu�ers can be in uservirtual memory, this yields a completely user-modereceive capability. A system call variant of receive isprovided to allow a process to sleep and wait for thearrival of a message noti�cation on a speci�ed queue.DD is su�ciently integrated into the operating

system that one can optionally request that a pro-cess be signalled when an incoming message is avail-able. Quarks makes use of this to provide asyn-chronous handling of incoming requests, an alterna-tive we chose to provide to explicit polling within theapplication code. With DD as its transport, Quarkshas a very inexpensive polling capability. The choiceof polling versus asynchronous, signal-driven mes-sage handling is left to the user; the best choice isapplication-dependent.

2.2 Consistency Management

In Quarks, consistency protocols are assigned on aper-region basis. Currently two protocols are sup-ported: sequentially consistent write invalidate andrelease consistent write update. As in most DSMsystems, the Unix mprotect() system call is used tomodify the virtual memory protection of pages to de-termine when accesses to invalid (not locallymapped)pages or writes to read-only pages are made. Theimplementation of Quarks' two consistency protocolsare detailed in this section.

Sequentially Consistent Write Invalidate

Quarks' write invalidate protocol is the same basicsingle-writer protocol that originated in Ivy[18]. Anynumber of nodes can have read-only copies of a givenpage. When any node attempts to modify such ashared region, the attempted store instruction willtrigger a page fault, which is caught by the Quarksruntime system. In response, the runtime system willacquire an exclusive copy of the region, invalidate allother shared copies on other nodes, remap the re-gion to read-write mode, and re-execute the failed

store instruction. Reads from regions that have neverbeen loaded or have been invalidated also cause pagefaults. In response to these faults, the Quarks run-time system locates a copy of the region, loads it,maps it read-only, and (if necessary) remaps any ex-clusive (read-write) copies that exist on other nodesbefore re-executing the failed load.

Release Consistent Write Update

Quarks' write update protocol was derived fromMunin, and supports multiple concurrent writers todi�erent portions of a shared region. It exploits therelease consistency memory model [12], which (infor-mally) allows processors to independently modify re-gions of shared data between synchronization points.When a process performs a release operation, it isrequired to ensure that any changes to shared datathat have occurred on this node are ushed to re-mote nodes before they can perform the correspond-ing acquire1. To avoid such complexities as vectortimestamps, inde�nite bu�ering of changes, and dis-tributed garbage collection[14], Quarks uses an eagerimplementation of release consistency, where changesto shared data are ushed to remote nodes immedi-ately when a release operation is performed.This design requires that Quarks track the speci�c

bytes within each shared page that each writer modi-�es between ushes - writes to exclusive pages are nottracked. It does so using a delayed update queue and asimple di�ng mechanism, as illustrated in Figures 1and 2. Whenever a process writes to a read-onlypage, the page is added to a delayed update queue

(DUQ), its access is changed to writable, and a twin(copy) of the page is made. When a process reachesa release point, all modi�cations to shared pages aremade visible to other processes by the purge DUQ rou-tine. This is done by iterating over the pages con-tained on the DUQ, computing di�s and sending up-dates to other processes that share that page. Di�sare computed by comparing the current contents ofthe page to the contents of the twin made before thepage was �rst modi�ed. The resulting update mes-

1For Quarks, lock acquisition and release are acquire and

release operations for the sake of release consistency, while bar-

rier waits are e�ectively a release-acquire pair.

X

X

Copy on write

Make original writable

Write(X)

twin

X

Delayed UpdateQueue

Figure 1: Tracking Changes Using the DUQ

Write protect(if replicated)

X

X

X

UpdateReplicas

Compare& Encode

‘‘Diff’’

twin

Figure 2: Sending Out Di�s

sage is sent to each member of the page's copyset, theset of other nodes that a node believes have a copy ofa given page. On any given node, copysets may notbe complete. However, the owner process always hasa complete copyset and is always part of the copy-set. The header of an update message contains thesender's idea of the current copyset. To ensure thatupdates are propagated to all copyholders, the ownerdoes three things when it receives an update:1. It forwards the update to nodes that are in the

actual copyset, but which are not contained inthe sender's idea of the copyset.

2. It returns the correct copyset to the sender in itsacknowledgement.

3. It returns a count to the sender of the numberof nodes to which it has forwarded the update.

As each node must know when its updates havebeen observed, it keeps a count of the number sent

plus the number forwarded by owners on its behalf.The purge DUQ routine does not return until it hasreceived a number of acknowledgements equal to thatcount.

2.3 Useful Optimizations

Hiding Page Acquisition Stalls

Whenever a process accesses shared memory that isnot valid, it cannot proceed until the page is suppliedby the owner. However, the time spent waiting forthe page need not be wasted. In Quarks, a numberof operations can be - and are -performed after thepage request is sent:

� The protection of the target page is changed towritable on the assumption that the page that isreturned will need to written into it.

� If the failing access was a write, then a twin is al-located and made writable and the page is addedto the delayed update queue.

When the page arrives, it is copied into the targetpage. It is also copied into the twin if the failingaccess was a write. If the access was a read, thetarget page is then downgraded to read-only.

On the owner side, the node requesting the page isadded to the page's copyset. The owner then sendstwo reply messages to the requester: a header mes-sage containing the current copyset of the page anda second containing the data. Two messages are sentto avoid copying the page data into a bu�er with theheader. With the DD transport, it is much cheaperto send an additional message than to copy the page,both in raw latency seen by the requestor and inthe overhead experienced by the owner. Since theonly copying required is done by the network inter-face's DMA engine, the owner spends only a minimalamount of time servicing a page request. The impactof servicing remote page requests is a major factor insystem performance, because it often interrupts use-ful work, so keeping it small can dramatically improveperformance.

Hiding Di� Creation

Calculating di�s is one of the most time consumingactivities in Quarks. Fortunately it is one activitythat can frequently be overlapped with communica-tion. Purge DUQ will send out as many updates asit can, overlapping their calculation with their ac-tual transmission. As each page in the DUQ is pro-cessed, it is removed from the DUQ and its protec-tion is changed to read-only, to facilitate capturingfuture changes to the page. In practice, the protec-tion changes are all performed together, interleavedwith the reception of acknowlegdements, at the endof purge DUQ for two reasons:

1. Other nodes may concurrently send updates forthe same pages; it would then be necessary tochange the page protection to write and back toread-only again for any such updates.

2. Doing so allows the process to get its updates outto other processes faster, so they can acknowl-edge them that much sooner.

No-Copy Twin Creation

We found that twin creation, which involves a 4-kilobyte bcopy(), adds signi�cant overhead. Thissame observation is what drove Midway to detectwrites explicitly rather than via di�s[2]. We reducedthe impact of twin creation in the common case asfollows. Originally, whenever a clean page is dirtied,a twin is saved to facilitate di� creation at the nextrelease point. When a page's di�s are propagated, itstwin become stale. As an optimization, we updatedthe twins to match the base page as part of the di�creation process. The bene�ts of this optimizationare two-fold: (i) updating the twin occurs while boththe real page and the twin are in the cache as a re-sult of the di� calculation and thus is essentially free,and (ii) only those bytes that have actually changedneed be written to the twin, rather than the entire4-kilobytes.Two factors might make this optimization less use-

ful: (i) the page may not be modi�ed again, so notwin would have been required, and (ii) updates forthe page from other processes must be applied to both

the page and the twin to keep them in sync. Depend-ing on the number and size of these other updates, thesimple copying approach might have been cheaper.Our experience showed that, while eager updating oftwins sometimes lengthens the update propagationtime, it has overall positive e�ects on runtime.

Reducing Synchronization Stalls

When a lock acquire() occurs, the lock may eitheralready reside in the locking process, or it may residein another process. In the latter case, the DUQ is pro-cessed before the lock is acquired, even though releaseconsistency only requires that changes to shared databe ushed before a lock is released. This optimizationresults in two update phases, one prior to acquiringthe lock and one prior to releasing the lock, whichmight seem counterproductive at �rst glance. Thelogic behind the optimization is as follows. If manyprocesses are contending for a lock, processes willspend signi�cant time blocked prior to successfullyacquiring the lock. In this case, it is better to per-form most of the updates before blocking on the lockthan it is to wait until the lock is released, becausethe ushing process would not be able to make usefulprogress anyway. Waiting to ush all data until therelease point can serialize the update activity, whichincreases the time processes spend holding exclusivelocks, since the DUQ must be processed prior to re-leasing the lock. This e�ect reduces performance. Inthe case where the lock is not heavily contended-for,our approach has little or no negative impact.

Sending Uninitialized Data

We optimize the case of accessing unitialized data. Ifa page has not been initialized by the owner when apage request is received, the owner simply sets a spe-cial ag in the response message to indicate a zeropage and no data message is seent. In this case, own-ership is passed to the node performing the page re-quest.

2.4 Failed Optimizations

Two possible optimizations were attempted and sub-sequently rejected: (i) early propagation of updatesand (ii) self-invalidation of unused pages. This shouldnot be viewed as a general result showing them to beof no use, but rather a caution that they do not rep-resent low-hanging fruit on the optimization tree.

Early Propagation of Updates

One problem with release consistent protocols is thepotentially large number of update messages thateach process must send at a release point. This canresult in increased synchronization times or networkcongestion as many or all nodes simultaneously oodthe network with update messages. We attempted tominimize these e�ects by setting a target length forthe DUQ. If adding a new page to the DUQ wouldcause it to exceed this target length, the oldest entrywould be removed and processed �rst.We experienced two problems with this attempted

optimization. First, for certain programs, the modi�-cation lifetimes of all the shared pages being modi�edoverlapped, i.e., all of the modi�ed pages had to bewritable at the same time. A �x was added to allowsuch programs to avoid the livelock situatuion thatoccurred. The �x e�ectively turned the optimizationo� (dynamically) once the behavior was recognized.The second problem was that the optimization

slowed programs down by interleaving the propaga-tion of updates with the real work of the program.Update propagation overhead that might have beenat least partially hidden with otherwise idle time neara synchronization point was no longer hidden. In gen-eral, any DSM operation that interferes with usefuluser computation directly impacts performance andshould be avoided.

Self-invalidation of Unused Pages

In self-invalidation [4], a process decides that certainshared pages are no longer of interest, and it unilater-ally drops them from its working set. The advantagein doing this is that updates for those pages will nolonger need to be applied by the process, saving ittime. Other processes, when they �nd out about the

diminished copyset, can save time by not having tosend an update to the self-invalidated process, norwait for an acknowledgement from it.The drawbacks to this optimization are the cost of

its implementation and the possibility of decreasingperformance by self-invalidating pages that are sub-sequently referenced. The mechanism for selecting apage for self-invalidation is to remove read access toit and wait to see (by the occurrence of a read fault) ifit is referenced. Together, the removal of read accessand the handling of a signal, including unprotectingthe page, represent a heavy cost for pages that arestill part of the active working set. There is also asigni�cant cost to any false-positives identi�ed andthen self-invalidated. The subsequent access, whichde�nes the case as a false positive, causes a pagereload since the process no longer has a valid copyof the data.We used self-invalidation to support a competi-

tive update scheme[4, 10]. However, we were unableto achieve any performance improvement using thismethod over experiments using a variety of thresh-hold values.

3 Experimental Evaluation

In this section, we evaluate the performance of soft-ware DSM system we have developed. First, the ex-perimental setup is described. Next, Quark's is char-acterized in terms of basic operation costs. Finally,results for benchmark programs are presented.

3.1 Experimental Setup

The architectural support for DD is still under con-struction by the Avalanche project, so developmentand evaluation of Quarks was performed using thatproject's simulation tools.The simulation results reported here were obtained

from execution-driven simulations using the Paintsimulator[22]. Paint simulates the HP PA RISC 1.1architecture and includes an instruction set inter-preter and detailed cycle-level models of a �rst levelcache, system bus, and memory system similar tothose found in HP J-class systems. In addition, it

contains a detailed cycle-level model of the networkinterface and a simple Myrinet simulator that mod-els contention only at node inputs. For these partic-ular experiments, a simple processor is modeled at120 MHz, with non-blocking loads and stores. Mostinstructions (except oating point) take one cycle.The cache is direct mapped, virtually indexed, 256KB with 32 byte lines. It provides single-cycle loadsand stores, a one-cycle load-use penalty, and datastreaming. The Runway bus is also modeled at 120MHz. The memory system contains 4 banks and a17 entry write bu�er. The Myrinet is modeled at160 MHz, giving nearly 160 MB/second bandwidth;fall through time for the single switch is 27 proces-sor cycles (approximately 200 ns.), propagation de-lays are 1 cycle on each side of the switch. In theexperiments, cache misses to main memory had anaverage latency of 30 cycles. The simulation envi-ronment includes a small BSD-based kernel, whichprovides device drivers and provides system calls toset up connections, perform message send operations,support virtual memory (e.g., mprotect(), and han-dle signals.

3.2 Results

3.2.1 Basic Operations

To understand whole program performance, it is �rstnecessary to know the cost of the basic operationsunderlying that performance. Figure 3 reports thosecosts for Quarks. Each time is the average obtainedover several hundred iterations, so cold miss cache ef-fects should be minimal. Page copy is an optimizedversion that assumes proper alignment of the data,which is the case in Quarks. The times for signaldelivery and return from a signal handler are simplythat: entry and exit costs. Any work performed toactually deal with the signal entails additional over-head.

It is interesting to compare those times with basicmessaging overheads and latencies, given in Figure 4.The �rst two rows of Figure 4 represent the directlyobservable overhead. The next two rows present la-tency; a process with other work to do can overlapmost of this time with useful work. The contribu-

Figure 3: Cost of basic OS operations.

Operation Timescycles �secs

Signal Delivery 1997 16.6Signal Return 1474 12.3Page Protect 941 7.8Page Zero 1329 11.1Page Copy 2576 21.5

Operation Timescycles �secs

Poll/Receive 70 0.6Msg Transmit 160 1.331-way latency/64B 508 4.231-way latency/4KB 4160 34.66

Figure 4: Overheads and Latencies of Basic Messag-ing Operations.

tion of signal handling and data moving operationsis easily equivalent to message latency. The e�ectsof these same operations on overhead is an order ofmagnitude greater than that of messaging. This sup-ports our claim that it can be appropriate to trademore messages for reduced complexity and reducedoverhead.

Figure 5 reports times for a variety of basic Quarkscosts. Once again, each time is the average obtainedover several hundred iterations, so cold miss cache ef-fects should be minimal. Barrier time is for 2 and 16nodes; the latter is in parantheses. Getpage times arefor non-zero pages; \R" means the page is initially inread mode, \W" means it is in write mode and a twinis made immediately. The getpage times with \Z" arefor all-zero (not yet initialized) pages. Note that thebulk of the time in getpage is spent in local, non-message related activities: signal entry/exit, copyingthe page to its �nal destination, changing the protec-tion of the page. Also note the seemingly anomalouslarger cost of getting a read-only page. This results

Operation Timescycles �secs

Lock Acquire 1328 11.1Barrier 2007(13726) 16.7(114.4)Getpage (R) 16317 136.0Getpage (W) 12775 106.5Getpage (Z/R) 7152 59.6Getpage (Z/W) 8003 66.7

Figure 5: Cost of Basic Quarks Operations.

Program Problem Size Times (ms)Total Compute

Radix 1024K keys 3368 1261FFT 256K points 5737 3564LU 1024x1024 42672 35313Water 1331 mols 34621 34411SOR 4096x4096 10129 10125Gauss 768x768 20233 19579

Figure 6: Data set sizes and sequential executiontims.

from two factors: �rst, there is less work (relative tothe \W" case) to overlap with the RPC to get thepage from the owner. This results in the receive op-eration actually blocking in the kernel. Second, readonly pages su�er a �nal setting of the page protec-tion to read-only, which is not required for a writablepage and which cannot be hidden.

3.2.2 Whole Program Results

In this section we report results using six programs.Three (FFT, LU,and water) are from the Splash2suite[25]. SOR and gauss are from a study by Chan-dra, et al.[6]. Radix is from the Splash1 suite. Fig-ure 6 gives the problem size for each application, andthe uniprocessor version runtime.Figure 7 displays the speedups in the parallel run-

time phases for the 6 benchmarks. Note that thescale of the x axis (node count) is not linear. The to-

Compute Time

Number of Nodes

1 Node 4 Nodes 8 Nodes 16 Nodes

Sp

ee

du

p0

2

4

6

8

10

12

14

16

Radix FFT LUWaterSORGauss

Figure 7: Speedup curves from parallel phase.

tal times include all initialization of data structureswhich is generally performed by the master process,and thus represents a serial computation.

In considering performance, we note �rst o� thatradix performs dismally for all con�gurations, dueto its pathological sharing patterns. All processeswrite to all shared pages, resulting in all-to-all up-date propagation of a large number of pages at eachsynchronization. We include it only for completenessand as an indication of robustness under near-worstcase sharing behavior.

On the other end of the spectrum is SOR, whichonly uses shared data for distributed computation ofa global norm (basically a reduction operation). As aresult, the near-linear speedup even out to 16 nodesis not surprising.

Lu also does quite well, with a speedup of 9.5 at16 nodes. The sharing pattern in lu is such that eachportion of the array is only written once, and thenonly by one process. More importantly, the writesoccur before there are other sharers of the row, soalmost no updates need be propagated. Lu does rep-resent a good test of getpage performance, since pro-

cesses continually read additional pages throughoutthe computation.Gauss displays a speedup of 8 at 16 nodes. It

has two primary uses of write-shared data. First, itperforms a distributed reduction operation to selectthe next pivot row. Second, it uses shared memoryto pass the selected row between the owner and theother processes. This approach is used rather thanhaving the other processes access the row directly sothat rows do not become widely shared, which wouldinduce signi�cant amounts of useless update tra�c.Such sharing is useless since, other than the step atwhich a row is the pivot, only its owner needs to ac-cess it, i.e., sharing is one time only.Fft performance, like that of lu, is primarily deter-

mined by getpage performance, for similar reasons.The speedups tail o� quickly, which we attribute pri-marily to the small size of the problem that we wereable to simulate.Water achieves a speedup of about 8 at 16 nodes. It

is by far the most challenging of the programs (otherthan radix), as measured by the amount of activelyshared data. The run shown performs 16 synchro-nizations, sending out an average of 6000+ updatesper node (in a 16 node run).In summary, performance of Quarks is very good

for one program (SOR) and reasonable for a numberof others that vary in the emphasis of their sharingbehavior on getpage (producer-consumer) or activewrite sharing. We expect that speedup would scalebetter with problem sizes that are currently beyondour ability to simulate. It is interesting to note thataggressive, complex optimization of updates woulddo little to help the getpage-dependent applicationsthat are sensitive to reaction time on the part of thepage's owner. A �ne-grained implicit polling mecha-nism backed by a low cost polling primitive such asDirect Deposit's would be much more useful.

4 Related Work

This section compares our work with a numberof existing software DSM systems, focusing on themechanisms used by these other systems to reducethe amount of communication necessary to provide

Figure 8: Parallel phase compute times for 4, 8, and16 node runs. Times are in milliseconds.

Program Nodes1 4 8 16

Radix 1261 917 873 1539FFT 3564 1466 756 614LU 35313 10056 5760 3690Water 34411 11656 7095 4128SOR 10125 2732 1350 716Gauss 19579 6637 3210 1964

shared memory. We limit our discussion to those sys-tems that are most related to the work presented inthis paper.

Ivy was the �rst software DSM system [18]. Ituses a single-writer, write-invalidate protocol for alldata, with virtual memory pages as the units of con-sistency. The large size of the consistency unit andthe single-writer protocol makes Ivy prone to largeamounts of communication due to false sharing.

Lazy release consistency [14], as used in Tread-Marks [8], is an algorithm for implementing releaseconsistency di�erent from the one presented in thispaper. Instead of updating every cached copy of adata item whenever the modifying thread performsa release operation, only the cached copies on theprocessor that next acquires the released lock are up-dated. Lazy release consistency reduces the numberof messages required to maintain consistency, but theimplementation is more expensive in terms of compu-tation and memory overhead [14].

Midway [2] proposes a DSM system with entry con-sistency , a memory consistency model weaker thanrelease consistency. The goal of Midway is to min-imize communication costs by aggressively exploit-ing the relationship between shared variables and thesynchronization objects that protect them. Entry

consistency only guarantees the consistency of a dataitem when the lock associated with it is acquired. Toexploit the power of entry consistency, the program-mer must associate each individual unit of shareddata with a single lock. For some programs, mak-

ing this association is easy. However, for programsthat use nested data structures or arrays, it is notclear if making a one-to-one association is feasiblewithout forcing programmers to completely rewritetheir programs.Shasta [20] uses binary rewriting techniques to sup-

port the transparent interposition of software DSMon unmodi�ed binaries. Load and store instructionsare instrumented to test whether the shared data be-ing accessed is locally present, and, for store instruc-tions, writable. Shasta was able to support a versionof Oracle Parallel Server on a workstation cluster, butits performance is currently poor due to the high over-head associated with checking every load and store topotentially shared data.Cashmere [21] exploits DEC's Memory Channel

hardware, which supports very high performancewrites to remote memory, to reduce communicationoverhead, avoid the need for global directory lock-ing, and avoid expensive TLB shootdowns. The re-sulting performance for Cashemere was quite good,ranging from 8 to 32 on 32 nodes, which emphasizesthe importance of exploiting low latency networkingtechnology whenever possible.

5 Conclusions

In this paper we have explored the notion thatDSM systems can be made e�cient without requiringheroic e�orts to reduce the amount of communicationinduced by coherence management. We described theQuarks DSM system [15], which lacks many of thecomplex communication-reducing features of othermodern DSM systems (e.g., lazy release consistency-based coherence protocols, cumbersome user-directedcoherence, strict object semantics, etc.). Despite, orperhaps because of, its simplicity, Quarks performscomparably to these more sophisticated systems.The source of Quarks' high performance is our

attention to attacking the computation overhead ofDSM. The time spent performing operations such aspage fault handling, di� creation, synchronous I/O,memory protection manipulation and distributedgarbage collection can seriously impact the rate atwhich useful user computation is performed. By care-

fully implementingQuarks in a way that avoids manyof these sources of DSM overhead, and overlappingmuch of the rest with communication of synchroniza-tion stalls, we achieve high performance. As networkperformance continues to increase in dramatic fash-ion, we believe that attention to DSM computationoverhead will become increasingly important.

As a side e�ect of its simplicity, Quarks ishighly portable. It currently runs on �ve plat-forms (HP BSD, SunOS, Solaris, Linux, and DECUnix), and ports are underway to other plat-forms. Source to Quarks can be obtained fromhttp://www.cs.utah.edu/projects/quarks.

In the future, we plan to address three questions:

1. What communication optimizations continue tomake sense in an environment with rapidly im-proving network performance? Although wehave demonstrated that attacking communica-tion overhead is only half the battle when im-plementing a high performance DSM systems,we believe that a limited number of optimiza-tions geared towards reducing the number of I/Ooperations required to maintain consistency arewarranted.

2. How can DSM technologies be employed to im-prove the performance, portability, and avail-ability of distributed applications and services,such as distributed (clustered) �le systems, dis-tributed databases, and directory services [5]?

3. To what extent can highly tailored software DSMsystems be made to perform comparably to ded-icated hardware DSM machines, given modestamounts of hardware support? As part of theAvalanche multiprocessor project, we support amixed message passing and DSM model as partof the base architecture. However, unlike dedi-cated DSM machines, we are implementing theDSM portion of the system as a combination ofsystem-level software and specialized �rmware inthe communication infrastructure. Further workis necessary to evaluate how well this approachperforms compared to both software DSM andfully custom hardware DSM.

References

[1] H. Bal, M. Kaashoek, and A. Tanenbaum. Orca: Alanguage for parallel programming of distributed sys-tems. IEEE Transactions on Software Engineering,pages 190{205, Mar. 1992.

[2] B. Bershad, M. Zekauskas, and W. Sawdon. TheMidway distributed shared memory system. InCOMPCON '93, pages 528{537, Feb. 1993.

[3] N. Boden et al. Myrinet { A gigabit-per-second local-area network. IEEE MICRO, 15(1):29{36, Feb. 1995.

[4] J. Carter, J. Bennett, and W. Zwaenepoel. Tech-niques for reducing consistency-related communica-tion in distributed shared memory systems. ACM

Transactions on Computer Systems, 13(3):205{243,Aug. 1995.

[5] J. Carter, A. Ranganathan, and S. Susarla. Build-ing clustered services and applications using a globalmemory system. In Submitted to The Eighteenth

International Conference on Distributed Computing

Systems, 1998.[6] S. Chandra, J. Larus, and A. Rogers. Where is time

spent in message-passing and shared-memory pro-grams? In Proceedings of the 6th Symposium on Ar-

chitectural Support for Programming Languages and

Operating Systems, pages 61{73, Oct. 1994.[7] J. Chase, F. Amador, E. Lazowska, H. Levy, and

R. Little�eld. The Amber system: Parallel program-ming on a network of multiprocessors. In Proceedingsof the 12th ACM Symposium on Operating Systems

Principles, pages 147{158, Dec. 1989.[8] A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Raja-

mony, and W. Zwaenepoel. Software versus hard-ware shared-memory implementation: A case study.In Proceedings of the 21st Annual International Sym-posium on Computer Architecture, pages 106{117,May 1994.

[9] P. Dasgupta, et al., The design and implementationof the Clouds distributed operating system. Com-

puting Systems Journal, 3, Winter 1990.[10] M. Dubois, L. Barroso, J. Wang, and Y. Chen. De-

layed consistency and its e�ects on the miss rate ofparallel programs. In Proceedings of Supercomput-

ing'91, pages 197{206, 1991.[11] B. Fleisch and G. Popek. Mirage: A coherent dis-

tributed shared memory design. In Proceedings of the12th ACM Symposium on Operating Systems Prin-

ciples, pages 211{223, Dec. 1989.[12] K. Gharachorloo, D. Lenoski, et al. Memory consis-

tency and event ordering in scalable shared-memorymultiprocessors. In Proceedings of the 17th Annual

International Symposium on Computer Architecture,pages 15{26, Seattle, Washington, May 1990.

[13] K. Johnson, M. Kaashoek, and D. Wallach. CRL:High performance all-software distributed sharedmemory. In Proceedings of the 15th ACM Sympo-

sium on Operating Systems Principles, 1995.[14] P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy

consistency for software distributed shared memory.Proceedings of the 19th Annual International Sympo-

sium on Computer Architecture, pages 13{21, 1992.[15] D. Khandekar. Quarks: Distributed shared memory

as a basic building block for complex parallel and dis-tributed systems. Technical Report Master Thesis,University of Utah, March 1996.

[16] L. Lamport. How to make a multiprocessor computerthat correctly executes multiprocess programs. IEEETrans. on Computers, C-28(9):690{691, Sept. 1979.

[17] J. Laudon and D. Lenoski. The SGI Origin: A cc-NUMA highly scalable server. In Proceedings of the

24th Annual International Symposium on Computer

Architecture, pages 241-251, May 1997.[18] K. Li and P. Hudak. Memory coherence in shared vir-

tual memory systems. ACM Transactions on Com-

puter Systems, 7(4):321{359, Nov. 1989.[19] S. Paikin, Lauria, and A. Chien. High performance

messaging on workstations: Illinois fast messages(FM) for Myrinet. In Proceedings of Supercomput-

ing '88, 1995.[20] D. Scales and K. Gharachorloo. Towards transpar-

ent and e�cient distributed shared memory. In Pro-

ceedings of the 16th ACM Symposium on Operating

Systems Principles, Oct. 1997.[21] R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt,

L. Kontothanassis, S. Parthasarathy, and M. Scott.Cashmere-2L: Software coherent shared memory ona clustered remote write network. In Proceedings

of the 16th ACM Symposium on Operating Systems

Principles, Oct. 1997.[22] L. Stoller, R. Kuramkote, and M. Swanson. PAINT-

PA instruction set interpreter. Technical ReportUUCS-96-009, University of Utah - Computer Sci-ence Department, Sept. 1996.

[23] L. Stoller and M. Swanson. Direct deposit: A basicuser-level protocol for carpet clusters. Technical Re-port UUCS-95-003, University of Utah - ComputerScience Department, March 1995.

[24] J. Wilkes. Hamlyn - an interface for sender-basedcommunication. Technical Report HPL-OSR-92-13,Hewlett-Packard Research Laboratory, Nov. 1992.

[25] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta.The SPLASH-2 programs: Characterization andmethodological considerations. In Proceedings of the

22nd Annual International Symposium on Computer

Architecture, pages 24{36, June 1995.