Predicting the performance of distributed virtual shared-memory applications

23
Predicting the performance of distributed virtual shared-memory applications by E. W. Parsons M. Brorsson K. C. Sevcik The use of networks of workstations for parallel computing is becoming increasingly common. Networks of workstations are attractive for a large class of parallel applications that can tolerate the higher network latencies and lower bandwidth associated with commodity networks. Several software packages, such as TreadMarks™, have been developed to provide a common view of global memory, allowing many shared-memory parallel applications to be easily ported to networks of workstations. This paper investigates in detail the performance of several TreadMarks-based shared-memory applications on a modern network of workstations, identifying the extent to which different system components affect the efficiency of these applications. Then a performance model for such applications is developed and used to evaluate the impact future changes in technology are likely to have on performance. The results of the model indicate that current systems are limited in their performance by communications and software overhead for supporting the distributed virtual shared memory, rather than hardware delays. E ven though most parallel computing today is based on a message-passing programming model, it is easier to develop parallel programs us- ing a shared-memory image for interprocess commu- nication. In fact, some algorithms are extremely dif- ficult to parallelize by hand using explicit message passing. 1 As a result, much research has been de- voted to presenting a shared-memory image to pro- grams running on distributed-memory systems, an approach termed distributed virtual shared memory (DVSM). This is accomplished using a combination of virtual memory protection mechanisms to detect accesses to specific portions of memory, and soft- ware exception handlers to ensure that the memory images on different processors are kept consistent. Examples of such software-based DVSM systems are TreadMarks**, CVM, Munin, and Ivy. It is becoming increasingly common for networks of workstations (NOWs), instead of special-purpose mul- tiprocessors, to be used for parallel computation, pri- marily because of the cost-effectiveness of using commodity components. Also, the latest processor technology often appears in workstations before it can be incorporated into large-scale multiprocessors, which allows NOWs to attain higher aggregate per- formance relative to a large-scale multiprocessor for the same investment. This fact can be seen in that several recent winners of the Gordon Bell Prize in the best cost/performance category have used a NOW as their computing platform. 2-4 ©Copyright 1997 by International Business Machines Corpora- tion. Copying in printed form for private use is permitted with- out payment of royaltyprovided that (1) each reproduction is done without alteration and (2) the Joumal reference and IBM copy- right notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. IBM SYSTEMS JOURNAL VOL 36, NO 4, 1997 0018-86701971$5.00 © 1997 IBM PARSONS, BRORSSON, AND SEVCIK 527

Transcript of Predicting the performance of distributed virtual shared-memory applications

Predicting theperformance ofdistributed virtualshared-memoryapplications

by E. W. ParsonsM. BrorssonK. C. Sevcik

The use of networks of workstations for parallelcomputing is becoming increasingly common.Networks of workstations are attractive for alarge class of parallel applications that cantolerate the higher network latencies andlower bandwidth associated with commoditynetworks. Several software packages, such asTreadMarks™, have been developed to provide acommon view of global memory, allowing manyshared-memory parallel applications to be easilyported to networks of workstations. This paperinvestigates in detail the performance of severalTreadMarks-based shared-memory applicationson a modern network of workstations, identifyingthe extent to which different system componentsaffect the efficiency of these applications. Then aperformance model for such applications isdeveloped and used to evaluate the impactfuture changes in technology are likely to haveon performance. The results of the modelindicate that current systems are limited in theirperformance by communications and softwareoverhead for supporting the distributed virtualshared memory, rather than hardware delays.

Even though most parallel computing today isbased on a message-passing programming

model, it is easier to develop parallel programs us-ing a shared-memory image for interprocess commu-nication. In fact, some algorithms are extremely dif-ficult to parallelize by hand using explicit messagepassing. 1 As a result, much research has been de-voted to presenting a shared-memory image to pro-

grams running on distributed-memory systems, anapproach termed distributed virtual shared memory(DVSM). This is accomplished using a combinationof virtual memory protection mechanisms to detectaccesses to specific portions of memory, and soft-ware exception handlers to ensure that the memoryimages on different processors are kept consistent.Examples of such software-based DVSM systems areTreadMarks**, CVM, Munin, and Ivy.

It is becoming increasingly common for networks ofworkstations (NOWs), instead of special-purpose mul-tiprocessors, to be used for parallel computation, pri-marily because of the cost-effectiveness of usingcommodity components. Also, the latest processortechnology often appears in workstations before itcan be incorporated into large-scale multiprocessors,which allows NOWs to attain higher aggregate per-formance relative to a large-scale multiprocessor forthe same investment. This fact can be seen in thatseveral recent winners of the Gordon Bell Prize inthe best cost/performance category have used a NOWas their computing platform. 2-4

©Copyright 1997 by International Business Machines Corpora-tion. Copying in printed form for private use is permitted with-out payment of royaltyprovided that (1) each reproduction isdonewithout alteration and (2) the Joumal reference and IBM copy-right notice are included on the first page. The title and abstract,but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based andother information-service systems. Permission to republish anyother portion of this paper must be obtained from the Editor.

IBM SYSTEMS JOURNAL VOL 36, NO 4, 1997 0018-86701971$5.00 © 1997 IBM PARSONS, BRORSSON, AND SEVCIK 527

In most such systems, the granularity for memoryprotection is relatively coarse, namely the page size.With straightforward implementations of DVSM, thisgranularity leads to false sharing. This effect occurswhen one processor updates a data item in a mem-ory page, causing update or invalidation messagesto be sent to all other processors that have copiesof that page, even though those processors never ac-cess the particular data item just updated. With useof relaxed forms of memory consistency, it is pos-sible to allow several processes to write to a partic-ular page simultaneously, without causing excessiveexchanges of messages. With such forms of consis-tency, reasonably good speedups can be obtained forparallel applications, even on a NOW using off-the-shelf LAN (local area network) technology. 5

However, given that processor performance is in-creasing very rapidly relative to commodity networkperformance, it is not clear how well such DVSM sys-tems will perform in the future. Whereas significantspeedups (e.g., about three for the Barnes applica-tions, defined later) were reported for certain ap-plications on eight processors under the TreadMarkssystem," those same applications exhibited about halfthe speedup running on more recent processors con-nected by the same network.

In this paper, we describe the way in which we de-veloped a model to predict the performance of par-allel nvsa-baseo applications, using TreadMarks asa case study. The purpose of this model is to allowus to determine the effects of changes in technologyon the performance of applications. As an abstrac-tion of a system, a model does not include completedetails about every aspect of the system (e.g., thememory system in our case). Despite this, a modelcan often offer good approximations to performancegiven certain types of changes. For example, we usethis model to investigate the impact of processorspeed, network latency and bandwidth, and softwareoverheads on performance. Although this study isbased on applications using TreadMarks, we believethe approach we describe is generally applicable tothe modelling of applications based on other DVSMsystems as well.

The results show, not surprisingly, that consistencyactions and lock acquisitions are the limiting factorson performance with current technology. When webreak down the costs associated with these opera-tions into software delays (DVSM and communica-tion protocols processing time) and hardware delays(network time and adapter latency), we find that the

528 PARSONS, BRORSSON, AND SEVCIK

hardware delays do not constrain the performanceas much as the software delays. However, hardwaredelayswillbecome increasingly significant as process-ing speeds increase. The results also indicate thataggressive network hardware technologies by them-selves are not enough to achieve the same perfor-mance for systems with processors eight times fasterthan today's; in particular, latency hiding techniques(e.g., prefetching) will be required to maintain lev-els of processor efficiency comparable to thoseachieved now.

The next section discusses a specific DVSM system,TreadMarks, and the technology parameters that af-fect the performance of applications that run underit. The succeeding section presents performance re-sults for a number of applications running on ourexperimental system using TreadMarks. Then themodel that evaluates the performance of these ap-plications is presented, along with a validation oftheexecution times predicted by our model relative tothose actually observed. The model is then used topredict future performance limitations. Related workand conclusions are described in the last two sec-tions, respectively.

Technology parameters affectingperformance of DVSM

There have been several experimental implementa-tions of DVSM systems, all of which face the samefundamental challenge of maintaining consistencyof shared data without encountering an unaccept-able amount of synchronisation and communicationoverhead. The specific DVSM system we have cho-sen to study is the TreadMarks system. Developedfor a network of workstations, this system allows par-allel programs to interact transparently, using ashared-memory model, even though the processorsthemselves do not physically share memory. Tread-Marks, like its similar predecessors, does this by re-lyingon the memory protection mechanism providedby the hardware and operating system to detect ac-cesses to specific regions of memory in each proces-sor, and then invoking necessary actions to maintainmemory consistency. These actions involve exchang-ing messages between processors to update the con-tents of memory.

Lazy release consistency. In a strong memory con-sistency model, a programmer can assume that mod-ifications made to memory in a shared-memory seg-ment are automatically and immediately reflectedin the memories of other processors. This model

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

would correspond to the behaviour of multiple pro-cesses sharing a memory segment on a single work-station (possibly having multiple processors). Thismodel, however, tends to require a large number ofmessage exchanges when implemented on distrib-uted-memory machines.

As a result, a number of weak consistency memorymodels have been proposed that greatly reduce theamount of interprocess communication. They arebased on the observation that most parallel programsserialize modifications to a given data item from mul-tiple processors through the use of locks. In such pro-grams, it is possible to reflect modifications made tomemory only when a lock is released while preserv-ing the correctness of the algorithm (since no otherprocess should be reading from or writing to thismemory item while the lock is held). In fact, in thismodel, a processor only needs to be informed ofmodifications to a data item when it tries to acquirethe lock protecting the data. This condition allowsdifferent processors to simultaneously modify differ-ent variables that lie on the same page, as long asno two processors modify the same variable at thesame time.

The lazy release consistency (LRC) protocol? usedin TreadMarks is very similar to the basic releaseconsistency model just described, except that dataare simply marked as having been modified uponlock acquisition, delaying the application of the mod-ifications until the data are actually accessed. As thisprotocol has been described numerous times in thepast, we do not repeat this description here. Instead,we provide an example of the protocol in the Ap-pendix, where a number of terms are described indetail. Briefly, a fault occurs when a processor ac-cesses a page that requires some type of consistencyaction, resulting from the processor having receivedprior write notices for the page. To make the pageconsistent, the processor requestsdijffrom other pro-cessors that have modified the page.

Performance factors. The original TreadMarks studyreported reasonable speedups over a range of ap-plications on a network of eight DECstation**-5000/240workstations interconnected bya Fore ATM(asynchronous transfer mode) switch." Since thatstudywas done, processor performance has increaseddramatically while the ATM network technology stillis considered to be current. In the future, the per-formance of nvsv-based applications willbe stronglyinfluenced by relative changes in the performanceof these two components, as well as bydevelopments

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

in system software and compiler technology. Withinthis subsection some of the factors that might influ-ence the performance of DVSM applications in thefuture are discussed.

Processorperformance. Processor performance con-tinues to increase roughly by a factor of two every18 months. If this rate of increasing performance isnot matched in other system components, it will be

TreadMarks allows parallelprograms to interact transparently,

using a shared-memory model.

difficult to maintain current levels of performancefor parallel applications, as measured byspeedupor,equivalently, efficiency. (Speedup is the ratio of ex-ecution time on a single processor over that on mul-tiple processors, and efficiency is the speedup dividedby number of processors.) The reason is that, as thecomputation time decreases, the overheads due toprocessor interactions will represent a larger frac-tion of the overall execution time.

Ifwe consider the system used in the study by Kele-her" and Amza et al.,5 the processors represent tech-nology that is now five years old. If other parame-ters were to exhibit the same rate of improvement,commodity network bandwidth would have to in-crease from 155 Mbps (megabits per second) toabout 1 Gbps (gigabit per second), and network la-tency would have to drop from about 500 p,s (mi-croseconds) to 100 p,sfor a one-way message. Eventhough there are no technical reasons preventing net-work components from keeping pace with advancesin processor technology, development efforts forcommodity network components tend to focus onincreasing bandwidth rather than on reducing la-tency, the latter being most important to DVSM sys-tems.

Networkperformance. Two types of off-the-shelf LANtechnologies that are of interest in a NOW are Eth-ernet and ATM. The lO-Mbps Ethernet, which hasbeen very common up to now, is quickly being dis-placed by 100-Mbps Fast Ethernet, as the latter is

PARSONS, BRORSSON, AND SEVCIK 529

standard in many workstations sold today. This tech-nology offers much greater bandwidth, resulting inlower contention and shorter delivery time for largemessages, relative to its predecessor. The ATM net-works used today, in contrast, typically operate at155 Mbps, but these will soon be displaced by 622-Mbps networks now emerging on the market. Be-cause it uses a switch, an ATM network will incur lesscontention than a Fast Ethernet network for inde-pendent communications at the expense of addedlatency through the switch. This difference disap-pears, however, if one considers using switched FastEthernet hubs, which are becoming increasingly pop-ular.

Network adapters are becoming increasingly capa-ble in terms of the services they can provide. An ex-ample is the Cheetah ATM network adapter, devel-oped by IBM. 8 This adapter has the ability to readfrom and write to user-specified data buffers, thusavoiding having to copy messages to and from thekernel as has traditionally been the case. Experimentswith these adapters on our system have shown thatlatencies for small messages of 140 JLs can beachieved (as compared to 250 JLS for standard Trans-mission Control Protocol/Internet Protocol, orTCP/IP, stacks). Others have reported even lower la-tencies using experimental interfaces or protocols(e.g., References 9 and 10).

Systemsoftware. TreadMarks is but one example ofa DVSM system, albeit the one that is most common.As we gain more experience with such software, theoverheads associated with DVSM will decrease asmore efficient algorithms and data structures are de-veloped. Also, since communication latency due toprotocol processing isbecoming more significant, fu-ture systems are likely to provide lightweight pro-tocols for the types of applications examined in thispaper.

Compiler technology. One of the primary perfor-mance limitations for DVSM-based applications is thelatency for consistency actions and lock acquisitions.Research in compiler technology has devised tech-niques to hide the latency of cache misses, but thesesame techniques have been shown to be highly ap-plicable to paging for out-of-core computations. 11

Using such techniques to prefetch diffsfor pages thatwill be accessed shortly 12 or to acquire locks in ad-vance of when they are needed 13 may greatly improvethe performance of applications. Hiding latencies willlikely become more important in the future as it is

530 PARSONS, BRORSSON, AND SEVCIK

generally easier to increase bandwidth than to re-duce latency.

Performance analysis of TreadMarks on anetwork of workstations

This section describes our experimental system andthe performance results for some applications thatrun on it using TreadMarks.

Experimental platform. Our experimental platformconsists of eight IBM RISe System/6000* 43P (133MHz) workstations, each having 64 MB (megabytes)of main memory, and connected by two commoditynetworks, a 100Mbps Fast Ethernet and a 155MbpsATM. The Fast Ethernet comprises 3Com 3C595EtherLinkIII* *PCI (peripheral component intercon-nect) adapters and a 16-port Cisco hub, whereas ForePCA200E ATM adapters and a Fore ASX 200/WGATM switch comprise the ATM network.

Choice of applications. We have chosen six appli-cations to analyze for this study, including ones thatdo numerical computation, image analysis, and phys-ical systems modelling. These applications are:

SOR-This kernel performs a typical red-black suc-cessive over-relaxation (SOR) on a two-dimensionalgrid, which involvesiteratively updating each elementof the grid based on the values of its neighboring el-ements. The grid is partitioned across processors incontiguous areas, so communication only occurswhen a processor must read an element across aboundary; synchronisation is based on a barrier atthe end of each iteration.

Is-This kernel sorts 2N integers in the range fromzero to 2 8 - 1, using a bucket sort algorithm. Eachiteration in the algorithm consists of two steps. Inthe first step, the data-sharing pattern is migratory,whereas in the second, it is primarily read-only.

Barnes-This application simulates the evolution ofa system of bodies under the influence of gravita-tional forces (e.g., a system of galaxies). The iter-ative algorithm consists of phases with a producer-consumer data-sharing pattern between processorsin different phases. This sharing is relatively fine-grained, causing considerable false sharing and,hence, high fault handling overhead.

Sphere-The shallow water equations for a spher-ical surface are solved by this application. The al-gorithm used in Sphere is typical of those used in

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

Table 1 Summary of application characteristics

climate and weather modelling software. 14 Almostall shared-data structures are initialised at the be-ginning and then accessed in a read-only fashionthroughout the remainder of the execution. The oneexception is a solution vector whose elements canbe updated by any processor while the equations arebeing solved.

Water-This application simulates a system of wa-ter molecules in a liquid state. The main shared-datastructure is an array that is accessed by processorsin a partitioned manner. The primary sharing pat-tern is fine-grained and migratory, as intermolecu-lar forces are calculated, but some false sharing oc-curs across the boundaries of the array.

Raytrace-This application renders a two-dimen-sional graphical image created from a three-dimen-sional scene, using a raytrace algorithm. Almost alldata are shared read-only or exclusivelyupdated byone processor as each processor operates on its ownpartition of the frame buffer used to store the re-sulting image. However, for load-balancing reasons,processors may steal work from one another, caus-ing some degree of false sharing.

Table 1 lists the parameters used for each applica-tion and summarizes their execution-time perfor-mance on our experimental platform, based on theaverage of five trials for each case. The data sets thatwe have used are in manycases representative of pro-

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

duction runs of the programs. However, many of theapplications use iterative algorithms, and, in orderto reduce the length of experiments, we have lim-ited the number of iterations to only a few.We there-fore deliberately analysed only the parallel compo-nent of each application, since the serial initialisationwould otherwise have had too great an impact onour results. For Barnes and Sphere, two differentproblem sizes,determined by"particles" and "pitch,"respectively,are included in our experiments. (In thecase of Barnes, the tolerance differsbetween the twoproblem sizes, as recommended by Singh et al. 15 inorder to achieve the same relative error.)

Execution time overhead. We instrumented Tread-Marks to measure the amount of time spent in dif-ferent components, as follows:

Busy time-Busy time is the time spent by the ap-plication on actual eomputation.

Fault handling-This time is consumed by handlingread or write faults of the application to maintainmemory consistency across processors. The most sig-nificant component of fault handling is the sendingof requests to remote processors for (possibly mul-tiple) diffs, and waiting for their replies.

Empty time-This time isconsumed in handling first-time misses for pages accessed by each processor.Before consistency action can be taken, a full copy

PARSONS, BRORSSON, AND SEVCIK 531

Figure 1 The percentage of execution time spent in executing useful instructions and various overhead components

of the page must be obtained from another proces-sor (known as an empty miss). Although emptymisses occur as a result of a consistency fault, theyoccur with varying frequencies in different applica-tions, so they are modelled separately.

GC time-This time is consumed by garbage collec-tion, which is initiated at the end of a barrier if theamount of memory consumed by TreadMarks datastructures exceeds a predefined threshold. Garbagecollection results in essentially the same operationsas a consistency fault, requesting and receiving diffs,except it does so for all pages being activelyshared.As such, it is an expensive operation, because twoprocessors may exchange diffs during the GC phaseeven if their sharing patterns do not require such ex-changes.

Lock acquisitionand release time-This time is con-sumed in acquiring and releasing locks, which aretypicallyused to serialize access to shared data. Ac-quiring a lock involvessending a request to the lockmanager, which forwards the request to the currentholder of the lock. Releasing a lock typically doesnot require any messaging, unless another proces-sor is already waiting for the lock.

Sigio time-This time is consumed by the handlingof asynchronous I/O requests (i.e., SIGIO in UNIX**)

532 PARSONS, BRORSSON, AND SEVCIK

from remote processors resulting from DVSM activ-ities (i.e., barriers, faults, or locks). This componentdiffers from the previous ones in that it is not ini-tiated by the local processor, but is rather an over-head imposed by remote processors.

Barrier time-This time is consumed by barrier op-erations, which are used to synchronise all proces-sors involved in a computation at the same point.Apart from the delay waiting for all processors toarrive at the barrier, the principal cost of a barrieris all the memory protection systems calls that mustbe made to the kernel to invalidate pages as a resultof the combined write notices from all processors.

The results for each application are shown in Fig-ure 1. As can be seen, DVSM overheads can be quitesignificant, accounting for up to 60 percent of theoverall execution time in the case of Water. In gen-eral, the high cost observed for barriers is not dueto messaging, but rather due to imbalance betweenprocessors, which causes significant wait times. Inmany cases, this imbalance arises from differentprocessors having different amounts of DVSM over-heads, rather than from imbalance internal to thealgorithm. IS, Sphere, and Water all have a largeamount of locking overhead, whereas Barnes suf-fers mostly from garbage collection and fault-han-dling overhead.

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

Performance model of applications

In the previous section, we presented time spent indifferent components for each of our sixapplications.Next, we describe a model that we used to study theeffects of technology changes on the magnitude ofeach of these components. Our basic approach todeveloping this model is to break down the cost ofeach significantoperation in each component. In par-ticular, we separate the time spent in local and re-mote computation and in all major parts of the com-munication paths. We also take into account thecontention that occurs at local and remote proces-sors resulting from DVSM activities.

To obtain our measurements, we used a lightweighttiming facility in the Advanced Interactive Execu-tive" (AIX*) to measure elapsed time between var-ious points in the TreadMarks software. Since weran all our experiments while the system was qui-escent, the effect of daemon activity and interruptsunrelated to the applications is negligible (as is sup-ported by the repeatability of our experiments). Allapplications were small enough to fit into memory,so there was very little paging activity.

One assumption in our model is that network con-tention at a global level is negligible, because eithera switchednetwork isbeing used (e.g., an ATM switch)or the available network bandwidth is very high. Inparticular, we found that network contention wasquite low even for our nonswitched Fast Ethernetnetwork. We do, however, take into account the con-tention that arises between a processor and the net-work, a problem that can increase the cost of cer-tain DVSM operations.

Given the nature of our model, we do not attemptto characterize memory system performance, and in-stead assume that memory access costs are in mostcases unaffected by the changes in technology thatwe examine in the next section. It is possible, how-ever, to adjust the performance of the memory sys-tem at a global level, as we have done for the caseof increasing processor speeds in the second subsec-tion of the next section. Assessingthe effectsof mem-ory at a more detailed levelcan onlybe fullyachievedusing low-level simulation techniques.

DVSM overheads. Ideally, the time spent in each ofthe execution time components (as described pre-viously)would be evenly balanced across all proces-sors. As mentioned above, we have found that con-siderable imbalance can exist in several of these,

IBM SYSTEMS JOURNAL. VOL 36. NO 4, 1997

which can potentially lead to a large modelling er-ror ifnot taken into account. For example, in Barnes,there are a significant number of faults, but thesource of these faults alternates between processorzero and all other processors from one phase of thecomputation to the next. Averaging the faults acrossthe entire computation would lead to large inaccu-racies in the model.

To model the execution time of an application, weconsider separately each phase of the computation,as defined by the barriers. For each phase, we firstdetermine the total amount of time spent in eachcomponent (busy time or DVSM overhead), whosesum represents the nonidle time for the processor;the time required for the phase is simply the max-imum of these sums. (As Figure 2 illustrates, someprocessors may be idle at the end of a phase, waitingfor the slowest to arrive at the barrier. Note that eachprocessor may be in a particular component manytimes during each phase, which is represented by asingle "segment" in the graph.) The total executiontime for the application is given by the sum of thesemaximums over all phases of the computation.

To determine the time required by each component,we instrumented TreadMarks to collect, for eachtype of operation and each phase of the computa-tion, (1) the number of operations occurring on eachprocessor, (2) the average time spent in software,both on localand remote processors, and (3) the sizesof average messages involved in any interactions. Inaddition, we observed that requests interrupting re-mote processors can be delayed for several reasons:

• I/O interrupts may be temporarily disabled if theremote processor is already busy with some otherTreadMarks operation, causing the request to bedelayed until interrupts are re-enabled.

• The remote processor may have several outstand-ing requests, behind which the current requestmust wait.

• The remote processor may be operating in kernelmode, either because it experienced a page fault(possibly requiring a lengthy disk access) or be-cause it made a system call.

Since these delays can be quite significant, we alsomeasure the average time remote operations haveto wait as a result of the first two types of delay; wewere unable to measure the last because that wouldhave required access to kernel source code.

PARSONS. BRORSSON. AND SEVCIK 533

Figure 2 Components of parallel execution time between barriers. (Toestimate the time for a phase, we sum theestimated times required for each component on each processor and take the maximum value.)

We then input these data into a spreadsheet thatcomputes, based on the measurements, the timespent by each processor in each component duringeach phase, using these to estimate the overall com-putation time for the application. By breaking downthe cost of each operation (as illustrated later), wecan then predict the effects changes in technologywill have on the performance of the application.

Next, we provide more detail about the way the faultsand barriers are treated, as these represent the morecomplex of the components.

Faulttime. The major cost of handling a fault is com-municating with remote processors to obtain any nec-essary diffs for a page. Modelling a fault is the mostdifficult aspect of our model, particularly when diffsmust be obtained from several processors.

To illustrate this complexity, consider the case wherea processor (PI) must acquire diffs from two otherprocessors (PO and P2) as a result of a fault. (A sim-ilar but extended example illustrating the need formultipart diff requests is presented in the Appen-dix.) The details of such an operation are illustratedin Figure 3. When the fault occurs, a diff request mes-sage is first sent to the two remote processors, eachof which computes and returns a diff in parallel. Since

534 PARSONS, BRORSSON, AND SEVCIK

requests are very small (12 bytes), there is no sig-nificant wire time in the sending of requests, but itis quite possible for responses to be as large as 4 KBin size, corresponding to wire times of about 330 JLSon a Fast Ethernet.

Essentially, a multipart diff request is a pipelined op-eration where (1) the sending of the requests, (2)the receipt of the replies, and (3) the time spent onthe wire must be serialized. In our model, we thusestimate the cost of each of these three componentsand choose the longest in estimating the overall costof a multipart diff request.

Barrier time. The barrier time for slave processors(i.e., processors other than the master of the bar-rier) consists of three phases. First, write notices arecreated. Second, a synchronous request is then sentto the master, blocking the slave until the master re-leases the barrier. Finally, memory pages are inval-idated according to the write notices that are sentby the master. For the master processor, further pro-cessing occurs in receiving barrier request messagesfrom slave processors, merging write notices, andsending release messages back to the slave proces-sors. In general, the computation time ofthe masterprocessor is greater than that of any other proces-sor.

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

Figure 3 Detailed breakdown of a two-part diff request (units are approximately 100Ils)

,---------------------------------------------,

TIME~L3~}-1Pi -'.......II-r--f---+----+-_t--t---+---f---+----....r-----I---t----j--.illio/_----t---+-J>

"';,_f_ 1"'-- - --l

_____+__-+----i--..- ..----j-~+---+--+_+-+--_i~~~'_E-+CVT-...-~M_I:.J-+---+--t-.

P2--+--i-_,.....__+~+__--'-1--+-+~+__--+-+-+_-t__-+-+-.

In the best case, the master processor arrives at thebarrier first, so that the computation is not delayedby the receipt of barrier messages from the slaves.Thus, we model the time for a barrier as the sum offour components: (1) the time for the last processorto send a message to the master, (2) the time tomerge write notices, (3) the time for the master tosend a message to all the slaves, and (4) the averagepost-barrier computation time.

Networking overheads. In order to make it possibleto changenetwork parameters, we alsobreak downthecost of sending and receiving messages. For this pur-pose,we used simpleUser Datagram Protocol/InternetProtocol (UDP/IP) benchmarks, but since the actualseparation between hardware and software costs isdifficult to determine without detailed system infor-mation, we rely on prior studies 16 to choose appro-priate values for hardware latencies.

to those of the sender, and so we do not distinguishthe two.

More importantly, the time to make a send or re-ceive system call varies considerably from one invo-cation to the next, presumably because of cachingor buffer allocation issues. For example, if we mea-sure the cost of repeatedly sending a one-kilobytemessage to a remote processor, in one case flushingthe cache between system calls and in the other not,then the send ( ) system call time is 527 /LS in thefirst case and 91 /LS in the second for the Fast Eth-ernet on our system. As a result, we use a micro-benchmark suite 17 to first estimate the basic hard-ware and software cost breakdown of a messagingoperation, and then use the actual send ( ) andreceive ( ) measurements from each application toadjust for the cache and operating system effects justmentioned.

We divide send-side communication as follows: (1)software overhead to transfer a message to the ker-nel, to go down the protocol stack, and to set up thenetwork adaptor to transfer the message, and (2) la-tency for the network controller to begin sending themessage (assuming a direct memory access, or DMA,model). Costs for the receiver are measured simi-larly,but we have found in practice that they are dose

Wire time for UDP/IP messages over Fast Ethernetiscomputed as message size plus protocol overheads(46 bytes) divided by the bandwidth; for UDP/IP mes-sages over OC-3-based ATM, we use the establishedeffective bit rate of 135 Mbps."

The parameters for our two networks are shown inTable 2. Although the software protocol times are

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997 PARSONS, BRORSSON, AND SEVCIK 535

Table 2 Network cost parameters for each of the two networks

Table 3 Analysis of empty time test application (alltimes in IJ.s)

generally linear in the size of the message, deviationsof up to 40 percent can occur across message sizeboundaries that are powers of two. For example, theprotocol times for 1024-byteand 1025-bytemessagesare 142.5 p,sand 110.5 us, respectively. It is impos-sible to deduce the sources of these deviations with-out detailed information from the vendors, and so,for simplicity, we chose to ignore these fluctuations,averaging out the results in the ranges shown.

Model validation for simple tests. Even though wecollect a large amount of data for each applicationrun, these are in the form of averages (e.g., averagemessage size, average number of diff requests perfault, average computation times). Combined withour simple network models, the use of such averageswill introduce some error in our estimates of exe-cution time. In this subsection, we explore the ac-curacy of our model for faults and locks using thetest applications provided by TreadMarks.

For each test that follows, including the ones de-scribed in the next section, we ran each application

536 PARSONS, BRORSSON, AND SEVCIK

five times on a quiescent system and obtained boththe means and variances for all measured param-eters. In every case, we found that variances werequite small, with all measurements of a particularparameter differing by only a few percent.

Empty time. In the first test, we examine the cost ofempty page faults, as would occur for cold pagemisses. The test program uses two processors, thesecond of which faults on 1024 pages managed bythe first. Requests for page faults are 12bytes in size;responses are 4 KB in size (corresponding to the pagesize). Signals have been measured on our system totake 40 fLS. The results are shown in Table 3.

In the last row of the table, we show the estimatedtime for the operation according to our model, andrelate this time to the total amount of time measuredin the application for empty faults. (We do not showthe actual measured time since it would be redun-dant.) Clearly, the cost of empty page faults is dom-inated by software protocol handling, which repre-sents 60 percent and 63 percent of the total time forFast Ethernet and ATM, respectively. As can be seen,the accuracy of the model is quite high at 96 percentfor both tests.

Fault time. In our second test, we examine the costof making diff requests. The test program has threephases. In the first phase, processor 0 obtains small20-byte diffs from all other processors; in the sec-ond, allprocessors but the first exchange 20-bytediffsamong themselves; and in the third phase, proces-sor 0 obtains large 4120-byte diffsfrom all other pro-cessors. In the following discussion, the source re-fers to the processor requesting diffs and thedestination(s) to the processor(s) from which diffsare being requested. The breakdown for each of

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

Table 4 Analysis of fault time test application (four processors)

these phases for a four-processor experiment isshown in Table 4.

As described earlier, a multipart diffrequest is a pipe-lined operation where either the sending, receiving,or wire time limits the performance. This operationis shown in the table as an entry multiplied by thefactor corresponding to number of processors in-volved in the multipart diff request. (The times forsending and receiving are broken down in the firsttwo rows to allow the appropriate maximum valueto be chosen.) Once again, the software protocol han-dling time represents a significant fraction of the to-tal time, but DVSM software handling time and re-mote delays caused by message-handling contentioncan also be important.

We also show in Table 4 the maximum signal-han-dling time, which includes send or receive system calltime, handler compute time, and delay in process-ing, at each of the destination processors, taking intoaccount the times at which each request was sent.When the maximum time is large, it indicates thatthere is significant imbalance among the destinationprocessors; in this case, using this value in the cal-culation can lead to more accurate results. Showingthis value only serves to point out that imbalance ex-ists; our model in the next section does not use thisvalue because of the complexity of trying to do so.

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

Table 5 shows similar results for an eight-processorexperiment.

Lock acquisition. We now examine the cost break-down for lock acquisition. In the first test, two pro-cessors interact in such a way that the first makesrepeated lock requests to the second. In the secondtest, three processors interact in such a way that thefirst makes repeated requests for locks managed bythe second, but most recently accessed by the third.In all cases, the message sizes were 24 bytes for therequests, and four bytes for the replies. The resultsare shown in Table 6.

Discussion of errors. The previous section comparedthe model prediction of execution time componentsto measured values for simple test applications pro-vided by TreadMarks. The accuracy of these mod-els is affected by numerous factors, notably:

• Our model of the network is only approximate, asit assumes that costs grow linearly with the packetsize, with no unusual variations; also, in the caseof Fast Ethernet, we do not model the effects ofpacket collisions,which sometimes arise with largernumbers of nodes.

• Measurements of the delay at remote processesdo not include delays arising from the remote pro-cess running in the kernel; in particular, if two

PARSONS, BRORSSON, AND SEVCIK 537

Table 5 Analysis of fault time test application (eight processors)

Table 6 Analysis of lock time test application

packets arrive nearly simultaneously, the delay onthe second imposed by the interrupt-level process-ing of the first will not be captured.

• Finally, all our measurements only represent av-erages; it is particularly a problem with operationsthat involve multiple remote processors, in that itassumes that there is no imbalance, which is notusually the case (even if operations at the remoteprocessors are identical).

Despite these sources of errors, our model corre-lates well with actual measurements. The only sig-

538 PARSONS, 8RORSSON, AND SEVCIK

nificant exception is for the case of diff requests in-volving many parts, primarily resulting from ouroptimistic model of interprocessor interactions. Sincediff requests in practice have few parts, however, wedo not expect these errors to be significant.

Model validation for full applications. The test ap-plications demonstrate the basic accuracy of themodel by performing exactly the same DVSM oper-ation a large number of times. Real applications donot possess this uniformity and may be more greatlyaffected by the averaging that we perform. In par-

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

Figure 4 A comparison of the execution time, subdivided into its components, predicted by the model and normalizedto the actual execution time

I100

~ 80z0§

60&3(;Su,0w 40o~Zw~ 20wa.

0

~-"w00:::;:

SOR

~ § s~ ~

-" ~-"

~ ul ~..J s ~8 w w

0 0 0z z z z s z s z s z:::;: :::;: :::;:

IS BARNES BARNES SPHERE SPHERE WATER RAYTRAOESMALL BIG SMALL BIG

':'~'n ,; ., $, ,~

ticular, responses to diff requests may vary greatly,both in size and in the time required to compute andapply the diff, from one request to the next and fromone processor to the next. In order to assess theseeffects, we now examine the accuracy of the modelfor the full applications summarized in Table 1.

To this end, Figure 4 presents the same informationas in Figure 1,with the addition of a breakdown cor-responding to the model prediction. As can be seen,we still exceed 90 percent accuracy in the predictionof the total execution time for all applications. Someoverheads, such as operating system overhead andnetwork contention, are not part of the model, andtherefore the model typically underestimates someof the overhead components. (The biggest discrep-ancy is in barrier time in IS, but in this case, we aredealing with a small absolute error of about 90 mil-liseconds.) Even so, the model accurately predictsthe relative importance of the various execution timecomponents.

Performance trends anticipated for futuretechnologies

In our model, we have carefully broken down thedifferent costs associated with each significant DVSM

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

operation. This breakdown now allows us to vary dif-ferent technology parameters and estimate their ef-fects on the execution time of each application. Next,we explore the effectsof improvements in DVSM soft-ware, networking, and compiler technology on ourcurrent system.Then, we consider the effectsof fasterprocessors, given both current and future software,networking, and compiler technology. Finally, weconsider as a case study a specific next-generationsystem.

Effects ofDVSM improvements. There are two maincauses for poor performance in applications runningunder TreadMarks: lock acquisition and memory con-sistency (which consists of empty, fault, and garbagecollection times). The relative magnitude of theseoverheads, however, is highly dependent on the rel-ative changes in different technology parameters.Figure 5 shows the overhead reduction when someof the more important parameters are changed.These parameters are (1) the number of faults, (2)lock acquisition costs, (3) protocol costs, and (4) net-work bandwidth and interface latency, each of whichis described in detail in the following subsections.

Fault reduction. The fault-handling time is, of course,most effectively reduced if the application can be re-

PARSONS, BRORSSON, AND SEVCIK 539

Figure 5 Overhead reduction for changes in some technology parameters for current generation system

structured to minimize its need for communication.In the context of this study, however, we considerprefetching as a latency-hiding technique to over-lap communication with computation, an approachthat has already been shown to be effective in bothshared-memory multiprocessors 19,20 and in networksof workstations. 12

Although prefetching may reduce the latency asso-ciated with faults, it does not reduce the overheaddue to requesting, computing, and applying diffs,andwill likely increase trafficon the interconnection net-work as a result of mispredicted prefetches. In ourmodel, the cost of a pre fetch is the sum of softwarecosts that are normally incurred at the source pro-cessor for a fault, but without the penalty of waitingfor replies. (For the purpose of this analysis, we as-sume that no faults are mispredicted to show an up-per bound on performance improvements. It is rel-atively straightforward to augment the model toinclude mispredicted prefetches.)

Applications whose overhead is dominated by fault-handling time benefit the most from fault time re-duction. SOR is one such application in which theoverhead times can be reduced by 13 percent if halfof the faults can be prefetched (see Figure 5). How-

540 PARSONS, BRORSSON, AND SEVCIK

ever, the overhead is already relatively small for thisapplication, and the resulting absolute performanceimprovement is limited.

Other applications benefit less since the relative im-portance of the fault-handling overhead time issmaller compared to SOR. Applications that havelarge-sized diffs, such as Barnes (over 3-Kbyte diffson average), cannot fully benefit from prefetching,since the software costs at the source and destina-tions are quite high; moreover, when several diffsmust be prefetched, all sends contribute to overheadin the source processor, rather than just the first fora regular fault. (The remaining sends occur when theprocessor would otherwise be idle waiting for the re-sponse.)

The only application that is almost unaffected byfaultreduction is Sphere. The reason is that there is verylittle read and write sharing in Sphere; most datastructures are initialised at the beginning of the ex-ecution and then subsequently only read. Thus far,we have only been considering dynamic prefetchingbased on reference history, but if explicit prefetchoperations could be inserted based on compiler anal-ysis, it would be possible for faults due to cold misses(i.e., empty time) to also be reduced.

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

Lock acquisition timereduction. Prefetching can alsohe used to hide the latency to acquire locks (by ini-tiating the lock request somewhat before it is need-ed). Although locks are used by release consistencyprotocols to maintain consistency, we find that thereis little contention for locks in the applications westudied, so the potential increase in lock holding timewould not inflate contention unacceptably. It has pre-viously been shown that lock prefetching requestscan be generated automatically by a compiler algo-rithm."

We include the effect of lock pre fetching in the modelin a manner similar to the case of data prefetching;the major difference in characteristics is that locksrequire a single send operation from the source andincur much smaller software overheads than faults.As shown in Figure 5, a 50 percent reduction in lockacquisition time translates to reductions in overheadranging from 8 percent to 20 percent for IS, Sphere,Water, and Raytrace (i.e., all applications havinglocks).

Protocol reductionfactor. The protocol reduction fac-tor reduces all overheads in the model that are re-lated to protocol execution. We find that applica-tions with a high degree of locking overhead ormemory consistency overhead (e.g., Barnes and Wa-ter), benefit greatly (15 percent reduction in over-head for a 50 percent protocol cost) from lower-la-tency communication facilities (relative to UDP/IP).

We believe that there is potential to reduce the pro-tocol overhead significantly in future systems. Thisreduction can be achieved by using lighter-weightprotocols or by simply using lower levels in the pro-tocol stack (e.g., the AAL, or ATM adaptation layer,in the ATM interface). Furthermore, newer ATM in-terfaces may perform some of the protocol functionsin hardware which would also reduce latencies. 21

Network interface latency and other parameters. Fi-nally, removing the network interface latency has amoderate effect on performance across all applica-tions (except Water). The reason is, of course, thatwith current processor technology, the majority ofthe costs lie within software.

Other parameters that we studied did not have anappreciable effect on the execution time of applica-tions. In particular, changes in the network band-width do not appear to be important, given our cur-rent processor speeds.

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

Effects of faster processors. Next, we consider theimpact of processor performance, both on the ef-ficiency of applications and on the benefit of the typesof technology changes described in the previous sec-tion. As mentioned earlier, the performance of pro-cessors is increasing by a factor of two every 18months, so in five years, processors may be eighttimes faster than the ones used today.

To model faster processors, we decrease the timespent in various components according to the in-crease in processing speed; the only components thatare not modified are the network interface latencyand the wire delay (arising from bandwidth limita-tions). In general, however, the benefits offaster pro-cessors cannot be fully realized in system software,because such software tends to have poor cache lo-cality for both data and instructions. 22,23 For exam-ple, the extra system call time that we have includedin our model arises because protocol code and dataare often not found in cache. As another example,the minimum remote lock acquisition time reportedby Cox et al. on 40-MHz processors24 is actually lowerthan on our system. As a result, we consider two cas-es: one where all software gets the full benefit offaster processors, and another where the system soft-ware gets only a 50 percent benefit. (In particular,we model a system software benefit ofx by increas-ing the processor speed by a factor of x every timethe actual processor speed doubles, i.e., increases by100 percent.) Although this value is only a rough es-timate of what may occur in practice, it illustrateshow such behaviour can affect performance.

Figure 6 shows the model's prediction of the effi-ciency of each of our applications as processor per-formance increases. We show the range of relativeprocessor speeds, from 0.25 (roughly 1992) to 16.0(roughly 2000). As expected, the efficiency decreaseswith increasing processor speed because hardwaredelays will become relatively more important. Someapplications are affected less than others by this.Sphere, for instance, retains much of its efficiencybecause most of the overhead is in lock acquisition,which is highly software intensive in both DVSM andcommunication protocol code. For many of the ap-plications, performance is relatively constant over arange of processor speeds if system software ben-efits fully from faster processors. If we consider thecase where only 50 percent of the benefit can beachieved (Part B ofFigure 6), efficiency varies moredramatically; in this case, all applications exhibit rap-idly decreasing efficiency, with Barnes and Spherebeing most severely affected.

PARSONS, BRORSSON, AND SEVCIK 541

Figure 6 The efficiency as a function of the processor performance (1 represents an IBM 43P-133 MHz)

1 2 4CPU PERFORMANCE

0,5

CPU PERFORMANCE

ISBIG BARNES

- 81GSPHERE_ RAYTRACE

90

80

>-0 70zw 60Uu:u,

50ww;t 40~w

300cr:wa. 20

10

0

0,25 0.5 2 4 8

SOR-- SMALL BARNES

SMALL SPHEREWATER

Table 7 Parameters of a future-generation system

In Figure 7, we show the effects of the same over-head reductions considered in the previous section(and Figure 5), but this time using processors eighttimes more powerful. In this case, using prefetchingfor data appears to have the greatest impact on per-formance, ranging from a 15 percent to a 56 percentreduction in overhead for SOR, IS, Barnes, and Wa-ter. If system software cannot benefit fully from in-creases in processor speeds, however, then the ben-efit of prefetching will be reduced. Finally, factorsthat were insignificant with the slower processor,namely network interface latency and network band-width, are now much more important.

A representative fnture system. In five years time,processor performance willbe about eight times thatof today's high-performance microprocessors. Ifother technology parameters stay the same, we cansee from Figure 6 that the processor efficiency willdecrease. In order to anticipate the future situation,we have applied the model with a set of technologyparameters that we believe will be realized withinfiveyears (or possibly sooner). These parameters aresummarized in Table 7. Current VLSI (very large-scale integration) technology is sufficientlyadvancedto accommodate 10 Gbps networks and interfaces. 21

The signal latency time is likely to decrease withfaster processors, and operating systems are likelyto provide mechanisms for more effectivelyhandlingremote procedure calls. 16

Part A of Figure 8 compares the breakdown of ex-ecution time of this future system with that of cur-rent technology, assuming system software can fullybenefit from improvements in processor perfor-mance. As can be seen, it is not only possible to main-tain the same level of efficiency as today, but also toactually improve it in all cases. The largest improve-ments are in memory consistency and lock acquisi-tion operations, both overheads that are substantially

542 PARSONS, 8RORSSON. AND SEVCIK IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

Figure 7 Overhead reduction for technology parameter changes when the processor performance is eight timescurrent technology

RAYTRACEWATERSMALL BARNES BIG BARNES SMALL SPHERE BIG SPHERE

(A) FULL SYSTEM BENEFIT

IS

t-__-----=------------------~--.--------------------__I

+-...-----...1---------------------------------------------1

56

40

35.

30

isj::

20§li! 15

I10

5

SOR

------------1

40

UJ 35-+-~~ 30~

~ 25

~ 20

fila: 15

~

~10

5

0SOR IS

50%FAULTREDUCTION100%LOCK AGO_ REDUCTION50%INTERFACELATENCY= 0PROTOCOLTIME REDUCTION50%PROTOCOLTIME REDUCTION100%NETWORKBANDWIDTH622 MBITISNETWORKBANDWIDTH 10 GBIT/S

SMALL BARNES BfG BARNES SMALL SPHERE BIG SPHERE

(B) 50% SYSTEM BENEFIT

WATER RAYTRACE

diminished from the reduction in protocol softwarecosts.

Part B of Figure 8 shows the same comparison forthe case where system software can only partiallybenefit from improvements in processor perfor-mance. In this case, the efficiencyof current systemswillnot be sustained for most applications, even withthe aggressive latency-hiding techniques that might

be used. This reemphasises the fact that the singlemost important system component to optimize, evenin future systems, is the communication protocol soft-ware.

Related work

There are many studies of the performance of var-ious DVSM systems, but only a few of them present

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997 PARSONS, BRORSSON, AND SEVCIK 543

Figure 8 Execution time breakdown for a future generation system compared to today's technology

models that can be used to predict performance offuture systems.

544 PARSONS, BRORSSON, AND SEVCIK

Karlsson and Stenstrom25 studied TreadMarks-based applications running on a system consisting

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

of a numberofsmall-scale shared-memorymultipro-cessors connected by a standard OC-3 ATM switch.They used program-driven simulation to study theperformance effectsof faster processor, network, andinterface technologies. Their study confirms our find-ings that the interface latency is the performance bot-tleneck' but in contrast to our study, and due to theparticulars of their simulation model, they cannotseparate communication software cost from the com-munication hardware overhead in the network in-terface. Also, because of the limited resources avail-able when simulating a multiprocessor system, theyare unable to run realistically sized workloads.

Dwarkadas et at 26 studied the performance of dif-ferent release consistency options for four applica-tions, using predictions for the cost ofvarious DVSMoverheads. In their study, they examined the effectsof varying software overheads and of higher-band-width networks on the performance of applications;they show how performance for sharing-intensive ap-plications is most significantly affected by the soft-ware overheads. Later, Cox et al." used a similarsimulation approach (except using better estimatesof software overheads) to study the performance ofseveral DVSM applications in (what was then) a next-generation system; in this study, they also consideredthe effects of reductions in software overheads. Sincethese studies also used simulation, they used smalldata sets in their experiments.

Our work differs from these in that we (1) analysea wider range of applications and larger problemsizes (since we could run on real hardware), (2)present detailed cost breakdowns for several DVSMoperations, showing the high software costs actuallyincurred in modern systems, (3) develop a model ofDVSM applications, based on these breakdowns, and(4) use this model in conjunction with present-daymeasurements to predict the effect that a variety oftechnological changes may have on the performanceof these applications. Most significant is that wepresent a way to analytically model the performanceof applications.

Apart from TreadMarks many other DVSM systemshave been proposed and evaluated in the literature.We have chosen to mention a few of the more rel-evant ones:

• SoftFLASH 27 implements a clustered system sim-ilar to the one studied by Karlsson et al. 25 A majordifference from TreadMarks is that SoftFLASH isimplemented at kernel rather than at user level.

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

• CVM28 is a new DVSM system based on the Tread-Marks experience. It is designed in C++ with sup-port for multiple consistency models. In his study,Keleher argues that it is actually not necessary tosupport multiple writers to a shared page, butrather that lazy release consistency is most impor-tant.

.. Finally, Blizzard-S;" Shasta;" and Aurora 31 rep-resent DVSM systems that are based on shared ob-jects rather than pages, as in the case of the pre-viously mentioned systems. Since they cannot relyon hardware mechanisms to detect shared-mem-ory accesses, they modify the executable code tocheck the state of shared objects before accessingthem and to invoke the protocol software ifneeded. None of these studies, however, examinesthe performance effect of future processors andcommunications technology.

The model technique developed and used in this pa-per can also be applied to all the above-mentionedsystems in order to study performance effects ofchanging technologies. Software probes need to beinserted into the protocol software, This is probablyvery easy in the systems that run entirely in usermode, whereas in the case of SoftFLASH, access tothe kernel would be required. The actual model inthis paper is, ofcourse, targeted to TreadMarks andwould therefore need to be changed to suit the dif-ferent protocols accordingly.

Conclusions

A network of workstations is an attractive platformfor parallel computing because of its potential to de-liver high performance at a relatively low cost. It haspreviously been shown that it is even possible toachieve reasonable performance for a shared-mem-ory programming model if the consistency model isrelaxed. 5

Given that processor performance has been increas-ing much more rapidly than network performancesince the earlier measurements of DVSM perfor-mance were done," it is expected that applicationefficiencywould be lower for the same problem sizeon current systems, as we have observed. In this pa-per, we found that a distributed virtual shared-mem-ory system on a network of workstations indeed canstill deliver cost-effective performance, even whenusing present-day commodity network technology.It is, however, limited to a class of applications thathas a sufficiently high computation-communicationratio, such as those examined in this paper.

PARSONS, BRORSSON, AND SEVCIK 545

Our work shows how a model can be developed forparallel applications running on a DVSM system,which we use to study their performance as changesoccur in technology. As has been noted before, themain bottleneck for DVSM systems with current tech-nology is the software overhead in the communica-tion protocol. Latency-hiding techniques to reducefault-handling time and lock acquisition time can beeffectiveif implemented. However, with the expectedperformance of future processors, the performanceof these applications is likely to be constrained moreby hardware-related delays such as network inter-face and wire time.

Our model shows that the anticipated improvementsin DVSM and networking technology are likelyto per-mit the same relative application performance to bemaintained over the near to medium term. However,if system software cannot be written so as to takefull advantage of faster processors, it will be almostimpossible to achieve the same speedups (equiva-lent efficiency) as today. Also, it will not be possibleto maintain this performance if latency-hiding tech-niques are not used for memory faults and lock ac-quisitions.

In this study, we have focussed on how changes inprotocol and DVSM software and in processor andnetwork hardware will affect speedup for a set of ap-plications of specific size executing on eight proces-sors. Problems of greatest importance that will ex-ecute on DVSM systems of the future will involvemuch larger computations (e.g., sequential execu-tion times of hours or days rather than minutes orseconds) and willrequire many more processors. Themodelling approach described in this paper can beimmediately applied to any problem by running iton an existing system to gather the base statistics.With a deeper understanding of the various appli-cation parameters that are used by the model, itwould also be possible to apply the approach to prob-lems too large to be run today.

Acknowledgments

The network of workstations used for this study ispart of the Parallelism on Workstations (pow) proj-ect, which is a cooperative project between the Uni-versity of Toronto and the Centre for AdvancedStudies at the IBM Toronto Development Labora-tory. Three major themes within the pow project are(1) development of compilers that support automaticparallelization for a network of workstation environ-ments, (2) exploitation of prefetching to overcome

546 PARSONS, BRORSSON, AND SEVCIK

remote data access latencies in distributed-memorysystems, and (3) efficient multiprogrammed sched-uling of workloads dominated by parallel jobs.

We would like to thank Giridhar Chukkapalli whokindly provided the Sphere application. The researchin this paper was supported in part by the NaturalSciences and Engineering Research Council of Can-ada, the Information Technology Research Centreof Ontario, Northern Telecom, and the Swedish Na-tional Board for Industrial and Technical Develop-ment (NUTEK) under project number P855.

Appendix

To illustrate the LRC protocol, consider three pro-cesses sharing a single distributed virtual shared-memory page that contains two variables, VI andV2, each protected by a lock, Ll and L2, respectively.Figure 9 depicts a sequence of actions taken by theprocessors. Initially, the page is marked as valid, butwrite-protected in all three processors; all proces-sors can read the variables. Next, processor 0 acquiresthe lock Ll, and roughly at the same time processor2 acquires lock L2. When these processors modifythe variables corresponding to the lock they acquiredfor the first time, a page fault will occur (since thepage was initially write-protected), and a local copyof the page will be made in each processor; thesecopies, or twins, can later be used to determine whichportions of the page have been modified. The pageis then unprotected, allowing reads and writes to pro-ceed uninterrupted. Later, when the processors re-lease their locks, the fact that the page has been mod-ified is recorded in a write notice.

After the locks have been released, processor 1 ac-quires both locks, presumably to either read or writevariables VI and V2. Acquiring a lock involves send-ing a message to a preassigned manager of the lock,which forwards the request to the processor that lastheld the lock, which in turn responds with any writenotices that are associated with the lock. In our ex-ample, a message is sent to both processors 0 and2, both of which respond with a write notice for thesame page, causing processor l's copy of the pageto be invalidated. When processor 1 subsequentlyaccesses the page, a request is made to both otherprocessors for diffs, which record what changed inthe page on a given processor. (A diff is computedby comparing the current copy of the page againstits twin.) After the diffs have been computed, thetwins can be safely discarded.

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997

Figure 9 The lazy release consistency protocol in TreadMarks. (Processors are not notified of changes until the Jockacquisition, and pages do not get updated until the first reference after the lock acquisition.)

-----------,

COMPUTE DIFFFORV1DISCARD TWIN

RELEASE LOCKL1RECORD WRITE NOTICE

FIRSTWRITEACQUIRE REFERENCE TOV1LOCKL1 CREATE TWIN

Po_*__....llf- *""_-.,-------------r-"T'""-----.TIME

TWIN

DJFFV1

FIRSTREFERENCETO PAGE. OBTAINANDAPPLY DIFFS

APPLYDIFFSCONTINUE WITHMEMORYREFERENCE

RELEASE LOCKL2RECORD WRITE NonCE

TWIN

tTWIN

COMPLlTE DIFFFORV2DISCARD TWIN

DIFFV2

Processor 1 uses the diffsit receives to update its owncopy of the page with the modifications made by pro-cessors 0 and 2. Hence, once processor 1 has receivedand applied both diffs, there will be three differentversions of the page: one each on processors 0 and2 that reflect the changes done to the page locallyand one on processor 1 with an updated status con-taining changes made by both processors 0 and 2.In order for this multiple-writer scheme to work, itis assumed that the programmer does not associateoverlapping memory regions with different locks,since that would cause the diffs to partly relate tothe same addresses, and the final state of a sharedpage would depend on the order in which the diffswere applied.

TreadMarks also supports barrier synchronisationswhich, in addition to synchronising all processors,also cause the processors to exchange write noticesfor all shared-memory pages. Each barrier is asso-ciated with a managing processor that coordinatesthe actions of other processors. Basically, the man-

ager collects the write notices from all other proces-sors as they reach the barrier, and then redistributesthem back to all processors involved in the compu-tation. TreadMarks uses barriers to initiate garbagecollection (oc) if the amount of memory consumedbywrite notices, diffs, and twins exceeds a predefinedthreshold on any processor. If garbage collection isinitiated, then at the end of the barrier, all proces-sors compute and exchange diffsfor all pages, mak-ing all copies of each shared-memory page identi-caL

'Trademark or registered trademark of International BusinessMachines Corporation.

'*Trademark or registered trademark of PARALLELTOOLS,LLC, Digital Equipment Corporation, 3Com, Inc" or XlOpenCompany, Ltd.

Cited references1. J, P, Singh, A. Gupta, and M, Levoy, "Parallel Visualization

Algorithms: Performance and Architectural Implications,"Computer 27, No.7, 45-55 (July 1994),

IBM SYSTEMS JOURNAL, VOL 36, NO 4, 1997 PARSONS, BRORSSON, AND SEVCIK 547

2. A. H. Karp, K. Miura, and H. Simon, "1992 Gordon Bell PrizeWinners," Computer 26, No.1, 77-82 (January 1993).

3. A. H. Karp, M. Heath, D. Heller, and H. Simon, "1994 Gor-don Bell Prize Winners," Computer 28, No.1, 68-74 (Jan-uary 1995).

4. A. H. Karp, M. Heath, and A. Geist, "1995 Gordon Bell PrizeWinners," Computer 27, No.1, 79-85 (January 1996).

5. C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu,R. Rajamony, W. Yu, and W. Zwaenepoel, "TreadMarks:Shared Memory Computing on Networks of Workstations,"Computer 29, No.2, 18-28 (February 1996).

6. P. Keleher, Lazy Release Consistency for Distributed SharedMemory, Ph.D. thesis, Department of Computer Science, RiceUniversity, Houston, TX (January 1995).

7. P. Keleher, A. Cox, and W. Zwaenepoel, "Lazy Release Con-sistency for Software Distributed Shared Memory," Proceed-ings of the 19th International Symposium on ComputerArchi-tecture (May 1992), pp. 13-21.

8. R. Mraz, D. Freimuth, E. Nowicki, and G. Silberman, "Us-ing Commodity Networks for Distributed Computing Re-search," Proceedings of Workshop on Computer Networking:Putting Theory to Practice,1995Asian Computing Science Con-ference, Pathumthani, Thailand (December 1995), pp. 6-13.

9. E. Felten, R. Alpert, A. Bilas, M. Blumrich, D. Clark, S. Dami-anakis, C. Dubnicki, L. Iftode, and K. Li, "Early Experiencewith Message-Passing on the SHRIMP Multicomputer," Pro-ceedings ofthe 23rdAnnuai International Symposium on Com-puter Architecture, Philadelphia, PA (May 22-24,1996), pp.296-307.

10. T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net:A User-Level Network Interface for Parallel and DistributedComputing," Proceedings of the Fifteenth ACM Symposiumon Operating Systems Principles (December 1995), pp. 40-53.

11. T. Mowry, A. Demke, and O. Krieger, "Automatic Compiler-Inserted I/O Prefetching for Out-of-Core Applications," Pro-ceedings of the Second Symposium on Operating Systems De-sign and Implementation (1996), pp. 3-18.

12. M. Karlsson and P. Stenstrom, "Effectiveness of DynamicPrefetching in Multiple-Writer Distributed Virtual SharedMemory Systems," to be published in Journal ofParallel andDistributed Computing (September 1997).

13. M. Karlsson and P. Stenstrom, "Lock Prefetching in Distrib-uted Virtual Shared Memory Systems-Initial Results," IEEECS Technical Committee on Computer Architecture Newslet-ter, 41-48 (March 1997).

14. G. Chukkapalli, personal communication, Department of Me-chanical Engineering, University of Toronto, Toronto (e-mail:[email protected]).

15. J. P. Singh, J. L. Hennessy, and A. Gupta, "Implications ofHierarchical N-body Methods for Multiprocessor Architec-ture,"ACM Transactions on Computer Systems 13, No.2, 141-202 (May 1995).

16. C. A. Thekkath and H. M. Levy, "Limits to Low-Latency Com-munication on High-Speed Networks," ACM Transactions onComputer Systems 11, No.2, 179-203 (May 1993).

17. L. McVoy and C. Staelin, "Imbench: Portable Tools for Per-formance Analysis," Proceedings ofthe USENIX1996AnnualTechnical Conference (1996), pp. 279-294.

18. L. G. Cuthbert and J.-C. Sapanel, Chapter 2 inATM-TheBroadband Telecommunications Solution, The Institution ofElectrical Engineers, London, UK (1993).

19. F. Dahlgren and P. Stenstrom, "Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-MemoryMultiprocessors," IEEE Transactions on Paralleland Distrib-uted Systems 7, No.4, 385-398 (April 1996).

548 PARSONS, BRORSSON, AND SEVCIK

20. T. Mowry and A. Gupta, "Tolerating Latency Through Soft-ware-Controlled Prefetching in Shared-Memory Multipro-cessors," Journal ofParalleland Distributed Computing 12, No.2,87-106 (June 1991).

21. P. Sundstrom, M. Karlsson, and P. Andersson, "An Inter-face Architecture for a Low-Latency Network of Worksta-tions Using 10 Gbit/s Switched LAN Technology," Proceed-ings of the lASTED International Conference on Parallel andDistributed Systems, Euro-PDS'97, Barcelona (June 1997),pp. 166-176.

22. J. Chen and B. Bershad, "The Impact of Operating SystemStructure on Memory System Performance," Proceedings ofthe Fourteenth Symposium on Operating System Principles(1993), pp. 120-133.

23. A. Maynard, C. Donnelly, and B. Olszewski, "ContrastingCharacteristics and Cache Performance of Technical andMulti-User Commercial Workloads," Proceedings ofthe SixthInternational Conference on ArchitecturalSupport for Program-ming Languages and Operating Systems (October 1994),pp. 145-156.

24. A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, andW. Zwaenepoel, "Software Versus Hardware Shared-Mem-ory Implementation: A Case Study," Proceedings of the 21stInternational Symposium on Computer Architecture (1994),pp.106-117.

25. M. Karlsson and P. Stenstrom, "Performance Evaluation ofa Cluster-Based Multiprocessor Built from ATM Switchesand Bus-Based Multiprocessor Servers," Proceedings of the2nd Conference on High Performance Computer Architecture(February 1996), pp. 4-13.

26. S. Dwarkadas, P. Keleher, A. Cox, and W. Zwaenepoel, "Eval-uation of Release Consistent Software Distributed SharedMemory on Emerging Network Technology," Proceedings ofthe 20th Annual International Symposium on Computer Ar-chitecture (1993), pp. 144-155.

27. A. Erlichson, N. Nuckolls, G. Chesson, and J. Hennessy, "Soft-FLASH: Analysing the Performance of Clustered DistributedVirtual Shared Memory," Proceedings ofthe 7th InternationalConference on Architectural Support for Programming Lan-guages and Operating Systems (October 1996), pp. 210-221.

28. P. Keleher, "The Relative Importance of Concurrent Writ-ers and Weak Consistency Models," Proceedings of the 16thInternational Conference on Distributed Computing Systems(May 28, 1996), pp. 91-98.

29. 1. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R.Larus, and D. A. Wood, "Fine-Grain Access Control for Dis-tributed Shared Memory," Proceedings of the Sixth Interna-tional Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS VI) (October1994), pp. 297-307.

30. D. J. Scales, K. Gharachorloo, and C. A. Thekkath, "Shasta:A Low Overhead, Software-Only Approach for SupportingFine-Grain Shared Memory," Proceedings of the Seventh In-ternational Conference on Architectural Support for Program-ming Languages and Operating Systems (October, 1996).

31. P. Lu, "Aurora: Scoped Behaviour for Per-Context OptimizedDistributed Data Sharing," Proceedings of the 11th Interna-tional Parallel Processing Symposium (April 1997), pp. 467-473.

General referencesJ. K. Bennett, J. B. Carter, and W. Zwaenepoel, "Murrin: Dis-tributed Shared Memory Based on Type-Specific Memory Co-

IBM SYSTEMS JOURNAL. VOL 36, NO 4, 1997

herence," Proceedings of the 2nd A CM Symposium on Principlesand Practice of Parallel Programming (1990), pp. 168-176.K. Gharachorloo, A. Gupta, and J. L. Hennessy, "PerformanceEvaluation of Memory Consistency Models for Shared-MemoryMultiprocessors," Proceedings ofthe 4th lnternational Conferenceon Architectural Support for Programming Languages and Oper-ating Systems (April 1991), pp. 245-257.L. Iftode.J, P. Singh, and K. Li, "UnderstandingApplication Per-formance on Shared Virtual Memory Systems," Proceedings ofthe 23rd Annual International Symposium on Computer Architec-ture, Philadelphia, PA (May 22-24, 1996), pp. 122-133.K. Li, "IVY: A Shared Virtual Memory System for Parallel Com-puting," Proceedings of1988 International Conference on ParallelProcessing (1988), pp. 94-101.J. P. Singh, W.-D. Weber, and A. Gupta, "SPLASH: StanfordParallel Applications for Shared-Memory," Computer Architec-ture News 20, No.1, 5-44 (March 1992).

Accepted for publication May 20, 1997.

Eric W. Parsons Department of Computer Science, University ofToronto, 10 King's College Road, Toronto, Ontario, Canada M5S3G4 (electronic mail: [email protected]). Dr. Parsons cur-rently works at the Computing Technology Laboratory at North-ern Telecom. He recently completed his Ph.D. in the Departmentof Computer Science at the University of Toronto in the area ofmultiprocessor scheduling. His primary research interests are inperformance analysis, particularly in relation to multiprocessorsystems and mobile computing. He will be joining the Depart-ment of Electrical and Computer Engineering at the Universityof Toronto as an assistant professor in 1998.

Mats Brorsson Department ofInformation Technology,Lund Uni-versity, P.0. Box 118, SE- 221 00 Lund, Sweden (electronic mail:[email protected]). Dr. Brorsson is an associate professorin the Department of Information Technology at Lund Univer-sity, Sweden. His main research interests are in parallel archi-tectures and, in particular, performance analysis of shared-mern-oryparallel applications. He received the M.Sc.and Ph.D. degreesin 1985 and 1994, respectively, both from Lund University.

Kenneth C. Sevcik Department ofComputer Science, UniversityofToronto, 10King'sColiegeRoad, Toronto, Ontario, CanadaM5S3G4 (electronic mail: [email protected]). Dr. Sevcik is a profes-sor of computer science with a cross-appointment in electricaland computer engineering at the University of Toronto. He wasthe past Director of the Computer Systems Research Instituteand past Chairman of the Department of Computer Science. Hereceived a B.S. in mathematics from Stanford University in 1966and a Ph.D. in information science from the University of Chi-cago in 1971. His primary area of research interest is in devel-oping techniques and tools for performance evaluation, and ap-plying them in such contexts as distributed systems, databasesystems, local area networks, and parallel computer architectures.

Reprint Order No. G321-5657.

IBM SYSTEMS JOURNAL. VOL 36, NO 4, 1997 PARSONS, BRORSSON, AND SEVCIK 549