A Simple DSM System Design and Implementation

23
A Simple DSM System Design and Implementation T. Ramesh * , Chapram Sudhakar ** Dept. of CSE, NIT-Warangal, INDIA Abstract In this paper, a Simple Distributed Shared Memory (SDSM) system, a hy- brid version of shared memory and message passing version is proposed. This version effectively uses the benefits of shared memory in terms of easiness of programming and message passing in terms of efficiency. Further it is de- signed to effectively utilize the state-of-art multicore based network of work- stations and supports standard pthread interface and OpenMP model for writing shared memory programs as well as MPI interface for writing message passing programs. This system has been studied using standard SPLASH-2 (Stanford ParalleL Applications for SHared memory - 2), NPB(NAS Paral- lel Benchmarks), IMB (Intel MPI Benchmarks) benchmarks and some well known parallel algorithms. Its performance has been compared with JIAJIA DSM system that uses efficient scope consistency model for shared mem- ory programs and with MPI library on network of Linux systems for MPI programs. In many cases it was found to perform much better than those systems. * Principal corresponding author ** Corresponding author Email addresses: [email protected] (T. Ramesh), [email protected] (Chapram Sudhakar) Preprint submitted to Journal of Parallel and Distributed Computing September 10, 2009

Transcript of A Simple DSM System Design and Implementation

A Simple DSM System Design and Implementation

T. Ramesh∗, Chapram Sudhakar∗∗

Dept. of CSE, NIT-Warangal, INDIA

Abstract

In this paper, a Simple Distributed Shared Memory (SDSM) system, a hy-

brid version of shared memory and message passing version is proposed. This

version effectively uses the benefits of shared memory in terms of easiness of

programming and message passing in terms of efficiency. Further it is de-

signed to effectively utilize the state-of-art multicore based network of work-

stations and supports standard pthread interface and OpenMP model for

writing shared memory programs as well as MPI interface for writing message

passing programs. This system has been studied using standard SPLASH-2

(Stanford ParalleL Applications for SHared memory - 2), NPB(NAS Paral-

lel Benchmarks), IMB (Intel MPI Benchmarks) benchmarks and some well

known parallel algorithms. Its performance has been compared with JIAJIA

DSM system that uses efficient scope consistency model for shared mem-

ory programs and with MPI library on network of Linux systems for MPI

programs. In many cases it was found to perform much better than those

systems.

∗Principal corresponding author∗∗Corresponding author

Email addresses: [email protected] (T. Ramesh), [email protected] (ChapramSudhakar)

Preprint submitted to Journal of Parallel and Distributed Computing September 10, 2009

Key words: Distributed Shared Memory System, Memory Consistency

Model, OpenMP, Pthreads, Network of Workstations

1. Introduction

Network of workstations provide a better alternative for costly supercom-

puters for building parallel programming environment. It can be used as a

message passing system or shared memory system, for which software based

DSM view is to be provided.

There are mainly two categories of DSM implementations [9] - Hardware

level and Software level. In hardware level DSM implementations commu-

nication latency is considerably reduced, because of fine grain sharing that

minimizes the effects of false sharing. Searching and directory functions [11]

are much faster. Examples for hardware DSM implementations are Memnet

(Memory as Network Abstraction) [5], SCI (Scalable Coherent Interface) [7]

and KSR family of systems [10]. In software level DSM implementations, the

mechanism is quite similar to that of virtual memory management, except

that on a page fault, the data is supplied from local memory of the remote

machine. Software based DSMs can be implemented at several levels such

as Compiler, Runtime library and Operating system levels. In the compiler

level implementations, the access to shared data is automatically converted

in to synchronization and coherence primitives, as in Linda [1]. In the case of

user level runtime packages, the DSM mechanism is provided in the form of

run time library routines, which can be linked to the application. Examples

for this category are IVY [12] and Mermaid [21]. At Operating System level,

DSM can be implemented inside or outside the kernel. In the first case the

2

DSM mechanisms are incorporated into the operating system kernel where

as in the second case this mechanism is incorporated into the specialized

software controller processes. Mirage [6] is an example for inside the kernel

implementation and Clouds [17] is an example for outside the kernel imple-

mentation. Operating system level (inside the kernel) implementations are

generally considered to be more efficient among all software DSM implemen-

tations. The semantics of the operating system can be standardized and the

applications need not be modified for porting to a new DSM system [9]. This

standardization simplifies the programming effort in distributed applications.

The proposed SDSM system is inside the kernel implementation. This

system supports various memory consistency models and provides the user

the flexibility of selecting the suitable memory consistency model for differ-

ent parallel processing applications. It provides a simple pthread standard

interface and implicit communication primitives for writing shared memory

programs and explicit communication primitives for message passing parallel

programs. Both (shared memory and message passing) can be combined in

a single program also, with the concept of thread private memory. Multi-

processor or multicore systems can be programmed efficiently using multi

threading features (Shared memory approach) and network of workstations

using message passing features (Message passing approach). So combined

features of both, shared memory systems and message passing systems, can

be utilized in a single program, for efficient use of multicore based network

of workstations.

3

2. Design

The proposed system architecture is shown in the Figure 1. It is designed

as a multithreaded kernel. The lowest level is hardware abstraction layer that

consists of low level memory management, low level task management, basic

device drivers and scheduler. Low level memory management module is re-

sponsible for core allocation and freeing. Task management module provides

processor level task related functionality such as hardware context creation,

context switching etc. Low level device drivers provide very basic function-

ality to make use of the devices such as clock, storage, NIC etc. Scheduler

is responsible for scheduling the low level tasks using priority based round

robin scheduling algorithm. At the second level local process, virtual mem-

ory, file system and communication management services are provided. A

local process contains address space and a set of kernel level threads. Local

process management module provides functionality for creation, termination

and waiting/joining of processes and threads. Virtual memory management

module provides services for paging based memory operations such as mem-

ory segment creation, expansion and page table initialization and page at-

tribute manipulations etc. Communication system is specially designed for

efficiency using simple two layer reliable packet delivery protocol. Using these

second level services in the third level sequential process and parallel process

interfaces are provided to the application programs. Parallel process inter-

face provides services for remote thread creation, termination, distributed

synchronization related locks, barriers, conditions and a set of memory con-

sistency models. Some of the important modules of this system are described

in the following sub sections.

4

Figure 1: Layered Architecture of SDSM System.

2.1. Communication Management

The communication module is designed to suit the needs of both common

networking applications as well as parallel applications. At the lowest level

in the communication software, Ethernet frame delivery service is provided

at the data link layer level. On top of this, reliable packet delivery protocol

is developed. Using this reliable packet delivery service, different types of

high level communication services such as stream services, datagram services,

multicasting within a process group and broadcasting among the similar type

of processes of different nodes are developed. The communication subsystem

supported interfaces are shown in the Figure 2.

The reliable packet delivery service is capable of delivering various types

of packets between two end points. The end points used at this level are

low level internal sockets. These sockets are identified with their type, node

5

Figure 2: Communication Protocols and Interfaces.

number and port number. The type can be either stream type, datagram

type or default type. One default socket is created for each thread of the

parallel process. The reliable packet delivery protocol is based on sliding

window protocol. It has the functionality for eliminating duplicate packets,

for retransmitting the lost packets and for flow control. It makes use of

checksums for detecting errors in the received packets. Using this reliable

packet delivery service two types of interfaces are developed namely: socket

interface and parallel process communication interface (Figure 2).

2.1.1. Socket Interface

BSD socket interface is very popular for network programming. The same

interface is provided in the new system for supporting message based par-

allel applications and in general for supporting any networking application.

Basically there are two types of sockets exist - stream and datagram sockets.

6

2.1.2. Parallel Process Communication Interface

All threads of a global process, which may be remote or local, form a

group automatically. In addition to the shared memory abstraction, if the

application is aware of the physical distribution of threads among different

systems then those threads can use parallel process communication interface.

In this interface there are only two primitives namely send and receive. These

primitives support both blocking and non-blocking functionality.

The send primitive provides functionality for sending a message to a par-

ticular thread (Unicast) or to all threads of that process (Multicast). The

similar type of daemon threads (service providing) of all nodes together also

forms a group. The send primitive can also be used for broadcasting among

the daemon thread group.

The receive primitive provides functionality for receiving the messages

selectively or in the order of arrival. The selection can be based on the source

of the message, or the type of the message or both. Each message can have

a specific type value. This function also supports peeking of the messages.

When the input buffer given is not sufficient for holding the message then

optionally either it can be truncated or it returns the size of the input buffer

needed to hold the message as a negative value.

2.2. Local Processes and Global Processes

The proposed DSM system supports two types of processes namely Local

Processes and Global Processes. Local process is bounded to a single system

and it has only local threads. Global process is bounded to many systems

and has both local threads and remote threads. The local process structure

includes information about the process such as its unique identifier, memory

7

management information, scheduling parameters, open file table and a list

of local thread descriptors. Each thread descriptor points to a local thread

list entry, that contains thread specific information such as thread identifier,

stack, context and various masks regarding signals, interrupts etc. Global

processes additionally include home machine (Where the process is created),

a list of remote thread descriptors and memory consistency model descriptor.

For increasing the level of concurrency / parallelism in an application the

system supports kernel level threads. The system can accommodate large

number of threads in each process. But in practice, there is no need for

having large number of threads than the available number of processors, as

it increases the overhead. Once a global process is created, a new thread can

be created in that process on any one of the available nodes. The selection

of the available node can be based on current load conditions.

2.3. Name Server and Remote Thread Execution Server

Name server is responsible for allocating unique global process and thread

identifiers. When a global process is being created, the home node registers

its name with the name server and obtains unique global process identifier.

Subsequently when that global process creates other threads then it gets

unique thread identifiers from the name server. In addition to this, name

server is responsible for allocation of stack regions from the global stack

segment also. In order to provide these services the name server maintains

information about all global processes and threads. The name service is pro-

vided by the coordinator node. Any node can assume the role of coordinator.

Requests for remote thread creation, termination, monitoring etc. are

handled by the remote execution server. Remote execution server uses the

8

services of local process module. In order to a create a remote thread, the

requestor sends process identifier, new thread identifier, thread starting rou-

tine address, thread parameters, list of memory segments etc. to the required

remote execution server. When the receiver is creating first thread of the pro-

cess, it loads the executable image, sets the environment for the new thread,

creates the context and adds this thread to ready queue. Subsequent threads

creation on the same node does not require the loading of the image. The

data segment portion is not allocated, only memory management data struc-

tures i.e., page tables are prepared with the pages marked as not present. On

requirement, those pages are migrated or replicated according to the memory

consistency model. The threads created can be joinable or detached threads.

When a remote thread is terminating it releases its allocated thread region

and if it is not a detached thread it also sends its exit status to home node

(originating node). A detached thread need not send its exit status. During

threads execution time, it may receive cancellation requests also from other

threads of the process. The cancellation requests are delivered to the remote

thread through the remote execution server. At a safe point of execution the

remote thread accepts the cancellation request.

2.4. Synchronization Module

As this platform supports parallel and distributed applications there is a

need for synchronization among the local processes as well as global processes.

Synchronization module provides synchronization operations inside the ker-

nel (for system services) using binary semaphores, spin locks and general

event objects. For the applications, through pthread interface, mutex ob-

jects, condition objects and barrier objects are provided for synchronization.

9

Another common interface i.e., POSIX semaphores, which is used in some

standard parallel applications, is also provided. In addition to this function-

ality, for supporting parallel and distributed applications global synchroniza-

tion mechanisms are also implemented. General locks that can support both

binary and counting semaphores and mutex object types, condition objects

and barriers are provided using distributed algorithms. This global synchro-

nization functionality is transparently available to parallel and distributed

applications through standard pthread interface or POSIX Semaphore inter-

face.

2.5. Memory Consistency Models

For better availability of data, DSM systems generally replicate the data

among different nodes. Replication causes inconsistency when the same data

at different nodes is modified by threads simultaneously. To avoid these

inconsistencies several memory consistency models are provided for shared

memory systems. Programmer must consider the type of memory consis-

tency model supported by a particular shared memory system. The new

DSM system provides several consistency models such as sequential, PRAM,

Causal, Weak, Entry and Lazy Release memory consistency models. A de-

fault consistency model, Improved Lazy Release Consistency (ILRC) model

is provided in the SDSM system. ILRC implementation takes annotated in-

put source program. All the variables of the source program are divided into

two categories, small data items and large data items. Synchronization wise

related set of small data items are kept together in one page. Different such

sets are placed in to separate pages. The compiler includes the size element

for each related data item set. This indicates used amount of memory for the

10

variables in that page. Even though most of the page is wasted, for any par-

allel application usually very few such type of small data items exist. So the

overall wastage is very less. For small data item pages write notice records

are maintained. For large data items that may span multiple pages, any ad-

ditional information is not required. Those pages are treated normally with

write-invalidation protocol. The important part of this module is provision

of a framework for supporting any memory consistency model. This frame

work enables addition of any new or modified consistency model through its

generic interface.

The choice of consistency model is generally decided by the application

requirements. For example if an application heavily uses locking and unlock-

ing operations, then lazy release or scope consistency models [2, 13, 16, 19]

are best suited. If the application is protecting small portions of data with

locks and, data and corresponding locks are kept together, then entry con-

sistency model [2, 19] is more suitable. If the application is general one and

there are few number of lock and unlock operations and there is no clear

relationship between data and locks then sequential [2, 19] or lazy release

consistency models are applicable. A memory consistency model exhibiting

higher performance in one application may not be suitable for other appli-

cations. Providing multiple consistency models suitable for various parallel

applications, is one of the important design goals for the SDSM system. The

use of any consistency model is fully transparent to application, i.e., once

the consistency model is selected, automatically system makes calls to the

consistency model specific functions in the necessary situations without the

interference of the application.

11

Multiple consistency models in this system are supported through the

common interface that consists of set of operations as shown in Figure 3.

This common interface links the remaining system and memory consistency

module. The interface contains a set of function pointers to handle memory

consistency model related events. These function pointers refer to the model

specific event handling functions. Whenever a memory consistency related

event such as page fault, lock or unlock happens, through the common inter-

face object a call to specific event handler function of memory consistency

model is made. The sources of these memory consistency related events may

be locally running threads or remote threads in which case they are received

as remote requests, or may be replies for the previous locally generated re-

quests. Some of the event handling functions of the interface may not be

relevant for some memory consistency models. In such cases those unrelated

function pointers refer to dummy (empty) functions.

The functionality of the memory consistency manager is divided into logi-

cally separate parts to reduce the complexity and to increase the concurrency,

and each part of the functionality is implemented by separate handler threads

- Local request handler, Remote request/reply receiver, Remote request han-

dler, Reply handler and Timeout handler.

For all of the locally generated memory consistency related events, cor-

responding requests are created and queued in to the local request queue.

This local request queue is monitored by the local request handler and it

sends those requests to the corresponding remote systems. It also eliminates

duplicate requests for the same data by two or more threads of the same

process. The system can receive memory consistency requests and replies

12

Figure 3: Generic Interface of Memory Consistency Models.

from remote systems. The remote request / reply receiver is responsible

for receiving them and queuing into either remote request queue or reply

queue. Any systems memory consistency manager communicates with this

master memory consistency manager. Remote request handler monitors re-

mote request queue. These remote requests are processed by locating the

corresponding process and making calls to appropriate memory consistency

model specific routines of that process. If necessary, replies are sent back to

the requestor also. Reply handler monitors the reply queue. For an incoming

reply, it locates corresponding request, apply the reply specific modifications

in the required process by calling memory consistency model specific func-

tion and then wakeup all threads waiting on that request. Sometimes due

to network errors, replies or requests may be lost or delayed. In such cases

the requesting thread may block forever. In order to avoid this situation the

reply awaiting request queue is monitored by timeout thread. Whenever a

13

request doesnt receive reply within a fixed time period then that request is

once again transmitted.

3. Parallel Process Model

The given parallel program starts its execution on the initiated system,

known as the home machine for that process. The main thread of the process

allocates and initializes the global data structures required for computation

generally. Then it spawns the necessary number of threads to do the compu-

tation in parallel. These child threads can be created on other systems for

load balancing. This is generally followed approach for many multi threaded

parallel applications. Application might even choose to initialize the data

structures individually by all of the threads, independent of each other and

threads need not be live for the entire duration of the process. Such threads

can be created as and when required and when the given subtask is completed

the thread may be terminated.

Overall the parallel process memory layout is depicted in the following

Figure 4. In that diagram one parallel process is shown that is being exe-

cuted by number of nodes. The address space is divided into code, data, heap,

stack and thread private memory regions. All shaded portions of memory

space indicate the presence of that portion on that particular node. Lightly

shaded portions indicate read only portions where as darkly shaded portions

indicate read / write portions. For example entire code segment and a small

portion of (bottom of) data segment are shown in light shaded color indi-

cating read only replicated regions. The selected memory consistency model

decides how portions of memory must be replicated either as read only or

14

read/write portions. The stack segments that are specific to each thread can

be accessed for reading as well as for writing by the owner thread, where as

other threads can access them only for reading. Even though in the diagram

the individual stack segments are shown as if they are allocated in a fixed

manner, in practice as the threads are dynamically created any stack segment

can be allocated to any thread. There is another part of memory which is

below the stack region that is used by the threads as thread private memory.

This region is not shared with other threads. Usually thread private mem-

ory is allocated dynamically from this region by using special library calls.

The common usage of such private memory can be found in threads that

exchange information using communication operations. If the threads want

to communicate explicitly using send / receive calls, without interfering with

other threads send / receive buffers, each thread can allocate its own private

memory for these buffers.

4. Implementation

The proposed system has been implemented using a network of 30 Intel

P-IV based systems. All of these machines are connected with 100 Mbps Eth-

ernet LAN. The system has been implemented in C and Assembly language.

This has been cross compiled on a Linux machine using GNU C compiler.

This system is developed as a standalone system. In the initial stages of

development Bochs tool is used for testing the developed system. Bochs is

a x86-PC emulation software. It is a very small convenient tool that can

run on many platforms like Linux, Windows, Solaris etc., and is particularly

useful for testing the new Operating Systems.

15

Figure 4: Parallel Process Structure.

16

5. Results

Total three sets of input programs are taken for comparative study of

the new system. First set contains SPLSH-2 benchmark programs for shared

memory systems, second set contains Intel MPI Benchmark (IMB) programs

and third set contains NAS Parallel benchmark programs for MPI. In addi-

tion to these programs common parallel processing algorithms such as Reduc-

tion, Sieve (for finding prime numbers), Matrix multiplication, Merge sort,

Quick sort and Travelling salesperson problem, are also included for compar-

ison. Description of the benchmark programs can be found in [20, 22, 23].

All of the programs have been studied using 8 nodes, except for BT and

SP of NPB, for which 9 nodes are used. The comparative performances of

shared memory, NPB and IMB programs are shown in Tables 1, 2 and 3

respectively. Except for two cases for shared memory programs new sys-

tem has performed better than Scope consistency based JIAJIA library on

Linux systems. For these programs percentage of improvement ranges from

3.7% (FMM of SPLASH-2) to 28.8% (Merge sort). For MPI programs as the

lightweight two-layer communication protocol is used, the performance is ob-

served to be improved much better compared to Linux based MPI implemen-

tation. Actual performance improvement ranges from 11% (SP of NPB) to

37.1% (Bcast of IMB). Hence these results prove that the new model of DSM

system along with improved lazy release consistency model and lightweight

communication protocol is really useful in developing parallel applications

for network of workstations.

17

Table 1: Performance of Shared Memory Programs (Common Parallel Algorithms and

Splash-2 Benchmark Programs)

Application Data set JIAJIA SDSM % of

size of app. (mille secs) (mille secs) Improv

Reduction 8388608 222.89 183.08 17.9

Sieve 4194304 286.57 238.81 16.7

Matmult 512x512 835.82 652.74 21.9

Merge 5242880 491.78 350.25 28.8

Quiksort 5242880 791.24 813.93 -2.9

TSP 20 224.56 195.02 13.2

Bucketsort 4194304 10053.73 8278.61 17.7

LU 512x512 1860.94 1633.67 12.2

Radixsort 2097152 1713.03 1545.95 9.8

Barnes 16384 1552.24 1600 -3.1

FMM 16384 2364.18 2276.62 3.7

Ocean-cont 258x258 286.57 246.77 13.9

Ocean-non 258x258 326.37 270.65 17.1

Water-nsq 512 398.01 294.53 26

Water-spat 512 350.25 262.69 25

FFT 65536 39.8 35.82 10

18

6. Conclusions and Extension

A new software based DSM system, SDSM system, has been developed

and its performance is studied. In most of the cases the new system per-

formed much better than the known other systems. Such a kind of system

which supports multithreading over network of workstations transparently

along with special message passing features is very essential for the effective

usage of multicore based network of workstations. At present the two-layer

communication protocol supports only a single LAN. It should be improved

to support much larger networks connected with bridges or routers. The

main goal of the new system is to support the existing multiprocessor based

pthread programs or network based MPI programs without any modification

to the source code. But for using ILRC, SDSM requires additional annota-

tions. It is possible to develop a learning system to generate the required

relations dynamically without the need for annotations.

Table 2: Performance of NPB Programs

NPB Data set size Linux/MPI SDSM % of

Benchmark of app. (mille sec) (mille sec) improv

BT 24x24x24 15665.51 12203.43 22.1

EP 67108864 493.37 433.18 12.2

FT 128x128x32 817.67 702.38 14.1

IS 1048576 412.58 316.04 23.4

LU 33x33x33 34099.34 25131.21 26.3

MG 128x128x128 891.38 720.24 19.2

SP 36x36x36 37486.97 33288.43 11.2

19

Table 3: Performance of IMB Programs (Data Set Size: 64KB and 2 processes involved,

where ever pairs are required)

IMB Linux/MPI SDSM % of

Benchmark (mille sec) (mille sec) improv

Pingpong 3.08 2.01 34.7

Pingping 3.05 1.93 36.7

Sendrecv 1.85 1.24 33

Exchange 4.81 3.51 27

Allreduce 1.76 1.39 21

Reduce 2.59 1.79 30.9

ReduceScatter 0.93 0.67 28

Allgather 1.93 1.53 20.7

Allgatherv 3.6 2.88 20

Gather 2.88 2.39 17

Gatherv 3.01 2.5 16.9

Scatter 2.9 2.18 24.8

Scatterv 2.94 2.3 21.8

Alltoall 2.23 1.92 13.9

Alltoallv 2.8 2.43 13.2

Bcast 2.64 1.66 37.1

Barrier 0.11 0.08 27.3

20

References

[1] Ahuja, S., Carriero, N., Gelernter, D., Linda and Friends, IEEE Com-

puter, Vol. 19, No. 8, May 1986, pp. 110-129.

[2] Andrew S. Tanenbaum, Distributed Operating Systems, Pearson Edu-

cation, 2003.

[3] Bennett J.K., Carter J.K. and Zwaenpoel W. Munin: Distributed shared

Memory Based on Type-Specific Memory Coherence. Proc. Second ACM

Symp. on Principles and Practice of Parallel Programming, ACM, pp.

168-176. 1990.

[4] Bershad B.N., Zekauskas M.J. and Sawdon W.A. The Midway Dis-

tributed Shared Memory System. Proc. IEEE COMPCON Conf., IEEE.

pp. 1-20. 1993.

[5] Delp G., Farber D., Minnich R., Memory as a Network Abstraction,

IEEE Network, July 1991.

[6] Fleisch B., Popek G., Mirage: A Coherent Distributed Shared Memory

Design, Proceedings of the 14th ACM Symposium on Operating System

Principles, ACM, New York, 1989, pp. 211-223.

[7] Gustavson D., The Scalable Coherent Interface and Related Standards

Projects, IEEE Micro, February 1992, pp. 10-22.

[8] Honghui Lu., OpenMP on Networks of Workstations, Ph.D. Thesis, Rice

University, 2001.

21

[9] Jelica Protic, Milo Tomasevic, Veljco Milutinovic, A Survey of Dis-

tributed Shared Memory Systems, Proceddings 28th annual Hawaii in-

ternational conference on system sciences, 1995.

[10] Kendall Square Research Technical Summary, KSR Corporation,

Waltham, Massachusetts, U.S.A., 1992.

[11] Lenoski, D., Laudon, J., et al., The Stanford DASH Multiprocessor,

IEEE Computer, Vol. 25, No. 3, March 1992, pp. 63-79.

[12] Li and Hudak, Memory Coherence in Shared Virtual Memory Systems,

ACM Transactions on Computer Systems, Vol. 7, No. 4, November 1989,

Pages 321-359.

[13] John. B. Carter, Efficient Distributed Shared Memory Based on Multi-

Protocol Release Consistency, Ph.D. Thesis, Rice University, 1994.

[14] Michael J. Quinn. Parallel Computing Theory and Practice. Tata

McGraw-Hill Publishers. 2002

[15] OpenMP Architecture Review Board, OpenMP C, C++ Application

Program Interface, http://www.openmp.org.

[16] Peter Keleher, Lazy ReleaseConsistency For Distributed Shared Mem-

ory, Ph.D. Thesis, Rice University, 1995.

[17] Ramachandran U., Khalidi M. Y. A., An Implementation of Distributed

Shared Memory, First Workshop Experiences with Building Distributed

and Mutiprocessor Systems, USENIX Association, 1989., pp. 21-38.

22

[18] Robert C. Steinke, Gary J. Nutt, A Unified Theory of shared Memory

Consistency, Journal of the ACM, Vol. 51, No. 5, September 2004.

[19] Sarita V. Adve, Kourosh Gharachorloo, Shared memory consistency

models: a tutorial, IEEE. Computer, December 1996.

[20] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal

Singh, Anoop Gupta. The Spalsh-2 Programs: Characterization and

Methodological considerations. In the proceedings of 22nd Annual In-

ternational Symp. On Computer Architecture, pp. 24-36, June 1995.

[21] Zhou S., Stumm M., McInerney T., Extending Distributed Shared

Memeory to Heterogeneous Environments, Proceedings of the 10th In-

ternational Conference on Distributed Computing Systems, May-June

1990. pp. 30-37.

[22] Intel MPI Benchmarks User Guide and Methodology Description, Intel

Corporation Document Number :320714-001

[23] D. Bailey, E. Balszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R.

Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Si-

mon, V. Venkatakrishnan and S. Weeratunga, The NAS Parallel Bench-

marks, RNR Technical Report, RNR-94-007, March 1994.

23