Fast and flexible persistence: the magic potion for fault-tolerance, scalability and performance in...

8
Fast and Flexible Persistence: The Magic Potion for Fault-Tolerance, Scalability and Performance in Online Data Stores Pankaj Mehra and Sam Fineberg Hewlett-Packard {pankaj.mehra,fineberg}@hp.com Abstract This paper examines the architecture of computer systems designed to update, integrate and serve enterprise information. Our analysis of their scale, performance, availability and data integrity draws attention to the so- called 'storage gap.' We propose 'persistent memory' as the technology to bridge that gap. Its impact is demonstrated using practical business information processing scenarios. The paper concludes with a discussion of our prototype, the results achieved, and the challenges that lie ahead. 1 ODS Architecture Servers are computer systems that can concurrently process a large number of requests from a large number of requestors. Organizations that own and operate servers – ISPs (Internet service providers) and business enterprises – are rarely able to provide all of the content and services to their customers from a single server. The complexity of serving depends upon the richness of content and services offered, as well as upon the number of clients supported. Hierarchy helps combat the curse of complexity: Servers, storage systems, and routers are arranged in levels – or tiers – grouped by function, as well as by security attributes. Three tiers of servers are commonly deployed: the presentation tier, the application tier, and the data tier. Servers run the applications that transform stored data into information and business objects, which are then either directly exposed to client applications as content, or exposed to other servers through programmatic interfaces as Web Services. The precise characterization of a server’s workload depends upon its tier, the particular application mix it is running, the data behind it, as well as upon the nature of requests. This paper focuses on servers in the data tier. At an abstract level, data tier servers function as platforms for database and fileserver software, and bridge an enterprise’s storage subsystems to its application infrastructure. Requests originating either in the network or above application programming interfaces specify operations against stored data. A key characteristic of these operations is that they are encapsulated inside transactions, an abstraction that makes possible the vital ACID properties of atomicity, consistency, isolation and durability. Online data stores (ODS) are transaction processing systems that, besides exhibiting the ACID properties, also possess the performance, scale and availability to integrate all of the transactionally manipulated information needed by large-scale, adaptive enterprises. For instance, ODS for telecommunication companies: support the insertion of tens of thousands of call-data records per second; simultaneously provide data to billing, marketing and fraud detection applications; exhibit near-zero outage seconds per year; neither lose transactions nor corrupt their data; and can readily scale to meet the needs of a rapidly growing customer base or newly acquired customers. In addition, they can seamlessly integrate data, transactionally or otherwise, from all relevant data sources, be it call center applications, accounting software or data sources external to the enterprise. Similar examples of ODS can be found in retail, finance, supply-chain management, and other key segments of the information technology market. This paper presents persistent memory, a technology that combines the durability of disks with the speed of memory through the use of memory-semantic system area networks. It is argued that persistent memory will enable the construction of high-performance, scalable and fault-tolerant ODS. The rest of Section 1 defines the terminology, architecture and objectives of ODS design. Section 2 defines ODS workload characteristics and their implications. Section 3 introduces our persistent memory architecture. Section 4 summarizes our experiences from recent work in using persistent memory with ODS workloads. Section 5 concludes the paper. 0-7695-2132-0/04/$17.00 (C) 2004 IEEE Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

Transcript of Fast and flexible persistence: the magic potion for fault-tolerance, scalability and performance in...

Fast and Flexible Persistence: The Magic Potion

for Fault-Tolerance, Scalability and Performance

in Online Data Stores

Pankaj Mehra and Sam Fineberg

Hewlett-Packard {pankaj.mehra,fineberg}@hp.com

Abstract This paper examines the architecture of computer

systems designed to update, integrate and serve enterprise

information. Our analysis of their scale, performance,

availability and data integrity draws attention to the so-

called 'storage gap.' We propose 'persistent memory' as

the technology to bridge that gap. Its impact is demonstrated using practical business information

processing scenarios. The paper concludes with a

discussion of our prototype, the results achieved, and the

challenges that lie ahead.

1 ODS Architecture

Servers are computer systems that can concurrently

process a large number of requests from a large number of

requestors. Organizations that own and operate servers –

ISPs (Internet service providers) and business enterprises –

are rarely able to provide all of the content and services to

their customers from a single server. The complexity of

serving depends upon the richness of content and services

offered, as well as upon the number of clients supported.

Hierarchy helps combat the curse of complexity: Servers,

storage systems, and routers are arranged in levels – or

tiers – grouped by function, as well as by security

attributes. Three tiers of servers are commonly deployed:

the presentation tier, the application tier, and the data tier.

Servers run the applications that transform stored

data into information and business objects, which are then

either directly exposed to client applications as content, or

exposed to other servers through programmatic interfaces

as Web Services. The precise characterization of a server’s

workload depends upon its tier, the particular application

mix it is running, the data behind it, as well as upon the

nature of requests.

This paper focuses on servers in the data tier. At an

abstract level, data tier servers function as platforms for

database and fileserver software, and bridge an

enterprise’s storage subsystems to its application

infrastructure. Requests originating either in the network

or above application programming interfaces specify

operations against stored data. A key characteristic of

these operations is that they are encapsulated inside

transactions, an abstraction that makes possible the vital

ACID properties of atomicity, consistency, isolation and

durability.

Online data stores (ODS) are transaction processing

systems that, besides exhibiting the ACID properties, also

possess the performance, scale and availability to integrate

all of the transactionally manipulated information needed

by large-scale, adaptive enterprises. For instance, ODS for

telecommunication companies: support the insertion of

tens of thousands of call-data records per second;

simultaneously provide data to billing, marketing and

fraud detection applications; exhibit near-zero outage

seconds per year; neither lose transactions nor corrupt

their data; and can readily scale to meet the needs of a

rapidly growing customer base or newly acquired

customers. In addition, they can seamlessly integrate data,

transactionally or otherwise, from all relevant data

sources, be it call center applications, accounting software

or data sources external to the enterprise.

Similar examples of ODS can be found in retail,

finance, supply-chain management, and other key

segments of the information technology market.

This paper presents persistent memory, a

technology that combines the durability of disks with the

speed of memory through the use of memory-semantic

system area networks. It is argued that persistent memory

will enable the construction of high-performance, scalable

and fault-tolerant ODS.

The rest of Section 1 defines the terminology,

architecture and objectives of ODS design. Section 2

defines ODS workload characteristics and their

implications. Section 3 introduces our persistent memory

architecture. Section 4 summarizes our experiences from

recent work in using persistent memory with ODS

workloads. Section 5 concludes the paper.

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

1.1 Transaction processing

A transaction is a collection of operations on the

physical and abstract application state, usually represented

in a database. A transaction represents the execution of a

transaction program. Operations include reading and

writing of shared state. A transaction program specifies the

operations that need to be applied against application state,

including the order in which they must be applied and any

concurrency controls that must be exercised in order for

the transaction to execute correctly. The most common

concurrency control operation is locking, whereby the

process corresponding to the transaction program acquires

either a shared or exclusive lock on the data it reads or

writes.

A transaction processing system is a computer

hardware and software system that supports concurrent

execution of multiple transaction programs while ensuring

that ACID properties are preserved.

Atomicity: Transactions exhibit an all-or-none behavior, in

that a transaction either executes completely or not at

all. A transaction that completes is said to have been

committed; one that is abandoned during execution is

said to have been aborted; one that has begun

execution but has neither committed nor aborted is

said to be in-flight.

Consistency: Successful completion of a transaction leaves

the application state consistent vis-à-vis any

constraints on database state.

Isolation: Also known as serializability, isolation

guarantees that every correct concurrent execution of

a stream of transactions corresponds to some total

ordering on the transactions that constitute the stream.

In that sense, with respect to an executed transaction,

the effects of every other transaction in the stream are

the same as either having executed strictly before or

strictly after it.

Strong serializability: The degree to which the

execution of concurrent transactions is

constrained creates different levels of isolation in

transaction processing systems. ODS seek to

provide the strongest forms of isolation, in which

the updates made by a transaction are never lost

and repeated read operations within a transaction

produce the same result. The discussion of lesser

levels of isolation, namely cursor stability and

browse access can be found in standard database

texts.

Durability: Once a transaction has committed, its changes

to application state survive failures affecting the

transaction-processing system.

1.2 System architecture for transaction

processing

A transaction processing system contain the

following key components. The database writer mutates

the data stored on data volumes on behalf of transactions.

To ensure durability of those changes, it sends them off to

a log writer, which records them on durable media in a

fashion that they can be undone or redone. This record of

changes is called the database audit trail. It explicitly

records the changes made to the database by each

transaction, and implicitly records the serial order in which

the transactions committed. Before a transaction can

commit, the relevant portion of the audit trail must be

flushed to durable media. The log writer coordinates its

I/O operations with the transaction monitor, which keeps

track of transactions as they enter and leave the system. It

keeps track of the database writers mutating the database

on behalf of each transaction, and ensures that the changes

related to that transaction sent to the log writer by the

database writers are flushed to permanent media before the

transaction is committed. It also notates transaction states

(e.g., commit or abort) in the audit trail.

1.3 ODS capabilities of interest

Throughput: Data-tier server benchmarks measure request

throughput, generally in units of database transactions

per minute. Throughput in turn depends upon other,

narrower capabilities of component subsystems, as

well as upon the way a server has been architected

from its component subsystems. Key transaction-

processing throughput-related subsystem parameters

are:

processor-to-memory bandwidth, specified in

gigabytes per second, or GB/s;

link bandwidth of inter-processor communication and

I/O expansion networks, specified in gigabits per

second, or Gbps;

bisection width, which is the number of links in the

weakest cross-sections of those networks;

input/output (IO) operations per second (IOPS) for

randomly accessed durable media; and

access bandwidth (often measured as megabytes per

second, or MB/s) of sequentially accessed

durable media.

Scalability: The ability of a server to handle ever greater

request-processing loads can be assessed in the

context of either a single server or a cluster of servers.

The former case represents scaling up and the latter,

scaling out. Servers that scale out support, out of their

I/O subsystems, application-visible network interfaces

into a high-bandwidth, low-latency, message-passing

interconnection network, such as InfiniBand. That

network supports a variety of clustering functions,

including distributed memory parallel processing,

cluster file system, and failover. This allows the

construction of scalable and highly available request-

processing solutions. Servers that scale up support,

out of their processor-memory subsystems,

application- and OS-invisible interfaces into a shared

memory network. That network supports a cache

coherence protocol — usually some variety of

directory-based cache coherence — among the

memory controllers associated with individual

processors or processor groups. An additional IOEN

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

(I/O expansion network), such as PCI Express or

StarGen, is then used in scaled-up servers to provide a

large number of I/O attachment points for the large

number of processors.

On-line transaction processing throughput can then be

scaled by partitioning the randomly-accessed data

across multiple data volumes (disk drives), with each

disk bringing additional IOPS capability. Further

throughput gains also accrue from the concurrent use

of multiple disks for sequentially accessed data, which

scales the total available disk bandwidth. Of course,

making use of either of these techniques requires the

use of an IOEN or IN with sufficient bisection width,

and scalable algorithms and protocols for concurrency

control and partitioning.

Availability: Organizations that automate business-critical

functions using servers seek to avoid or minimize

service outage, periods during which requests cannot

be serviced whether due to faults, planned

maintenance, or any other reason. Availability is often

measured by the number of leading 9s in the ratio

MTBF/(MTBF+MTTR). MTBF (mean time between

failures) is the inverse of the rate at which

components or subsystems fail; it is a reliability

measure. MTTR (mean time to recovery) measures

the degree of component redundancy in the system, as

well as measures the efficacy with which failed

components can be detected and then either repaired

in place or replaced by configuring in redundant

components. Highly available servers supporting 5 or

more 9s of availability provide fewer than 10 outage

minutes per year! Designs for achieving 6 or 7 9s of

availability are already in progress.

Critical transaction processing functions – database

writer, log writer, and transaction monitor – are

therefore often performed by process pairs,

comprising a primary and a backup process [1].

Periodically, and always before externalizing state

changes, the primary process check-points those

changes to its backup. If a software failure hits the

primary process, the backup process takes over

instantly, without having to reconstruct its state.

Data Integrity: Business-critical servers, such as those that

handle credit-card transactions, need to guard against

accidental corruption of data due to software or

hardware malfunction. Data integrity is characterized

using metrics such as probability of SDC (silent data

corruption) in the stateful components of the server,

namely memory and storage, and the probability of

undetected bit errors in the various data paths, namely

the busses and the networks inside the server. The

lower are these probability values, the greater is the

data integrity of the server.

The most common method of ensuring data integrity

is the duplicate-and-compare (D&C) approach, in

which the results of redundant computations, with

identical data and in identical state, are compared.

Failed comparisons indicate data corruption.

2 ODS Workload

ODS workloads tend to be insert heavy. The most

important consequence of an insert-heavy workload are (a)

high volume of check-point traffic between process pairs;

and (b) high volume of audit trail information.

Often, key portions of ODS workload stream will

exhibit response-time critical (RTC) behavior. This occurs

whenever application dependencies force transaction

response time into the denominator of the transaction

throughput equation. For instance, certain applications

wait to issue further transactions until the previously

issued ones have committed, while demanding high

throughput per application thread, or from a small number

of threads.

Consider for instance the case of ODS for stock

exchanges. Streams of buy and sell orders arrive from

brokerage systems and must be queued and matched to

generate trades. As a throughput optimization technique,

multiple trades are often boxcarred into a single database

transaction. Regulatory dependencies push the response

time of prior trade transactions into the issue path of

subsequent transactions involving the same security,

which leads to an RTC behavior known as the Hot Stock

problem. On any given day, a few ‘headline stocks’ may

account for a sizeable portion of the day’s trading volume;

however, the throughput on such stocks is inversely

proportional to the response time of trade transactions. To

complicate things further, even though the average

throughput across securities in the same data partition

improves with greater degree of boxcarring, the

transaction response times lengthen, thereby limiting

throughput on hotly traded securities.

Modern development frameworks for stateful

applications, such as Entity Beans of Enterprise Java

Beans, offer a variety of persistence mechanisms under the

umbrella of object-relational mapping. In practice, with

most of these architectures, the issue rate (thereby the

throughput) of a single application server thread is

inversely related to the response time of database

operations. Legacy batch-oriented transaction-processing

applications likewise exhibit RTC-like behavior.

The long pole in completing and committing

transactions is the action of making the effects of their

data-manipulation operations durable. One key

consequence of response-time-criticality in insert-heavy

ODS workloads is therefore the need to write audit trails

with high throughput and low latency.

Inevitably, information is made durable by either

writing to disks or copying around in a clustered system

through checkpoints and multicasts. This is why the

completion time of at least one – and typically more than

one – disk I/O, and the round-trip times of several request-

reply message operations, are included in the response

time of every transaction that obeys the benchmark ACID

properties of transaction processing systems. A fast and

flexible alternative to disks – and to excessive data

copying in lieu of going to disks – is highly desirable for

achieving high performance on the aforementioned

applications.

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

3 Persistent Memory

3.1 Nonvolatility

Persistent memory (PM) is memory that is durable,

without refresh. Its contents survive the loss of system

power. It additionally provides durable, self-consistent

metadata in order to ensure continued access to data after

power loss or soft failures. It is designed to exploit high-

speed durable media emerging in response to demands

from server and storage industry, on the one hand, and

consumer device industry on the other.

There is an emerging market full of potential high-

speed successors to FLASH memory. These include

principally Magnetic RAM (MRAM) from Motorola [2],

Cypress, NVE, IBM-Infineon and Toshiba, as well as

OUM, TFE, FeRAM, McRAM and a host of other slower

contenders. While the current 4Mbit/chip MRAM

capacities are woefully inadequate for enterprise needs, a

number of battery-backed DRAM (BBDRAM) products

fill the ‘storage gap’ that exists between RAM and disk,

albeit at the cost of system complexity due to batteries and

backup media.

3.2 First-level I/O attachment

Disk-based storage sub-systems routinely incorporate

BBDRAM as write caches, and various research projects

are designing architectures to provide storage-semantic

access to MRAM. However, the performance of such

systems is hampered by (a) connectivity via second-level

interconnects (such as ATA, SCSI and Fibre Channel);

and (b) block-oriented access via storage-oriented

protocols and device drivers. The handling of SCSI

commands, DMA, interrupts and context switching results

in 100s of microseconds – usually milliseconds – of I/O

latency.

Both BBDRAM and MRAM allow for direct

connectivity to the CPU-memory sub-system [6]. Directly

connected PM may be accessed from user programs like

ordinary virtual memory, albeit at specially designated

process virtual addresses, using the CPU’s memory

instructions (Load and Store). Our first-generation PM

specifications ruled out this method of attachment

primarily because the memory falls in the same fault

domain as the CPU. Moreover, the semantics of store

instructions in microprocessors, and the associated

compiler optimizations, can also play havoc with

durability guarantees. Directly-connected PM will be an

attractive long term option, especially as the protection

and isolation features of shared memory interconnects and

microprocessors evolve further, and as compiler support

for persistence matures.

Our architecture instead specifies attaching PM to a

first-level I/O interconnect supporting both memory

mapping and memory-semantic access, as well as

providing adequate fault isolation between PM device and

CPU on the one hand, and between mirrored PM devices

on the other. Examples of suitable interconnects include

PCI Express, RDMA over IP, InfiniBand, and HP

ServerNet.

3.3 Access architecture

We call this type of PM devices NPMUs (Network

Persistent Memory Units). NPMUs are implemented as

network resources accessed using RDMA read and write

operations, or using equivalent semantics through an API.

While originally proposed to facilitate device-

initiated RDMA, host-initiated memory-semantic access to

network-attached resources is a uniquely attractive way to

address high-speed non-volatile memory. It incurs only

10s of microseconds of latency. Its byte-grained access

and alignment eliminate the costly read-modify-write

operations required by block devices. Coupled with

address translation capabilities of RNICs (RDMA-enabled

Network Interface Cards), memory semantics also

eliminate the costly marshalling-and-unmarshalling of

pointer-rich data required by conventional storage.

Due to its long-latency, disk storage is best accessed

using asynchronous I/O techniques; persistent memory, on

the other hand, is fast enough to support synchronous

interfaces, especially for short access sizes.

In the near term, persistent memory will be accessed

using synchronous variants of RDMA read and write

interfaces similar to those described in VI Architecture

specification [3], InfiniBand Verbs layer [4], or ICS

Consortium’s IT API [5]. In the future, persistent memory

has the potential to bring storage truly under the memory

hierarchy. We will briefly speculate about those

possibilities in the final section of the paper.

3.4 Benefits of Persistent Memory

We now examine how ODS stand to benefit from the

characteristics of PM, such as low latency, byte grain,

synchronous interfaces, and low overheads of making

pointer-rich data durable.

Low latency of making information durable: Clearly, the

throughput of a transaction processing system on RTC

workloads will benefit from the use of a low-latency

alternative to disk drives. The ability to flush changes

to persistent memory instead of disk drives will allow

these systems to commit transactions rapidly without

sacrificing strong serializability.

Efficient data movement between address spaces: By

extending the standard memory-address translation

logic of network interfaces with its unique meta-data

handling capabilities, persistent memory greatly

increases the efficiency with which richly-connected

data structures can be copied between address spaces.

Like standard remote direct memory access [5], and

like one-sided memory operations in message-passing

interface specification 2.0 [6], persistent memory

semantics decouple data movement from

synchronization. Second, persistent memory supports

a variety of hardware-assisted pointer-fixing schemes,

including bulk write-selective read and incremental

update-bulk read. Marshalling-unmarshalling of data

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

structures, whether for check-pointing between

process pairs or for the purpose of saving on durable

media, can be drastically reduced or eliminated. This

allows ODS data structures, such as database indices,

lock tables and transaction control blocks, to be

efficiently stored to durable media.

Enablement of fine-grained persistence: Since PM is fast

and flexible, it enables applications to persist data that

would have been too cumbersome and too expensive

to persist with the traditional I/O programming model.

Just as easily, PM also supports transactional updating

of persistent stores, with an access architecture not

dissimilar to the mmap() and msync() primitives of

memory-mapped files. The difference is in the

performance because PM accesses are synchronous

and can go directly from user-mode software to

hardware RNICs. A more important difference is that

PM semantics offer a road to a durable information

store that is completely integrated into the memory

hierarchy. In any case, being able to update indices,

lock tables and transaction control blocks at a fine

grain reduces uncertainty regarding the state of the

database, and eliminates costly heuristic searching of

audit trail information, leading to shorter MTTR,

which is the mantra for both better availability and

data integrity.

Toward a persistence architecture: Through its unique use

of RDMA-enabled interconnects to address devices

using host-initiated RDMA, and through its

enablement of byte-grain access, PM closes the

storage gap and provides a durable address space at

the same speed and flexibility as remote directly-

accessed memory. For ODS that support enterprise

information integration using EJB entity beans or

other stateful web services mechanisms, this starts to

take away some of the pain of using the most

advanced persistence mechanisms, such as container-

managed persistence. Moreover, by unifying remote-

memory-based and disk-based performance

mechanisms, PM allows ODS programmers to focus

on the specification — rather than the implementation

— of persistence. For instance, on insert-heavy ODS

workloads, newly inserted rows or row-sets would be

made persistent once when they enter the database

writer, by synchronously writing to the NPMU. This

would then eliminate the repeated, wasteful and

uncoordinated persistence actions taken by ODS

components as the inserted data pass first from the

database writer primary to backup, then as audit

‘delta’ from the database writer to the log writer, then

again from the log writer to its backup, from the

database writer to data volumes and from the log

writer to log volumes.

4 Experiences with a persistent memory

prototype

HP is currently developing NPMUs for its NonStop

Servers. These servers run the NonStop Kernel (NSK)

operating system. NonStop servers consist of up to 256

“nodes,” each a cluster of up to 16 MIPS processors and

associated I/O devices. The processors and I/O devices are

interconnected by a redundant ServerNet network. While

each node presents a single system image, the system does

not support shared memory. Instead, processes

communicate using messages over the ServerNet network.

I/O adapters also attach into the ServerNet network, and

are managed across the network by ‘driver’ processes

running on NSK processors. This separation between

device manager and devices guarantees that devices can

continue to function even if the controlling processor fails.

ServerNet is a true RDMA network, supporting both

RDMA read and write operations in hardware.

ServerNet’s software latency is between 10 and 20

microseconds, depending on the generation of ServerNet

technology utilized. Its address space structure is that of a

single-address-space RDMA network, in that each

ServerNet end point presents a 32-bit network virtual

memory address space to initiators on the network.

Critical NonStop system functions are implemented

using process pairs. In the event of failure, the fault

detection and message re-routing capabilities of NSK and

the NonStop hardware allow a backup process to take over

from its primary process in a second or less. This allows

NSK to survive failures without any loss of committed

data and with virtually no loss of availability.

4.1 Architecture of the NonStop PM

Persistent memory technology is deployed on

NonStop servers in three pieces. First is the NPMUs,

which are additional devices attached to ServerNet.

Second comes the client-side PM access library and its

supporting NSK components. Finally, there is the PM

Manager, which manages NPMUs.

The NPMUs look a lot like traditional NSK I/O

adapters, but with the addition of non-volatile RAM.

However, the usage model of NPMUs is quite different

from that of storage I/O adapters. I/O adapters traditionally

act as RDMA initiators; i.e., when NSK wishes to read a

block from disk, it sends a request to a SCSI adapter, the

CPU in the SCSI adapter initiates a read from disk into

adapter RAM, and when the disk operation is complete,

the CPU in the adapter initiates an RDMA write to send

the data to the NSK CPU. This makes sense because the

NSK CPU can not directly interact with the disk since it is

on the other side of a SCSI interface. However, this model

inherently adds latency since a (potentially slow) CPU in

the adapter must be involved in data transfer. For PM,

however, there is no second-level device interaction

needed. Thus, it is possible for PM clients to directly read

and write memory in an NPMU, without any involvement

by a CPU in the NPMU. This inherently provides lower

latency than is possible with the typical I/O adapter usage

model.

To allow memory-like client access to PM, while still

providing data persistence, the NPMU must be managed

like a storage device. Therefore, our architecture uses a

Persistent Memory Manager (PMM) process pair for all

management functions. The PMM is a normal NSK

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

system process pair, and like other critical system services,

designed to transparently fail over from its primary to its

backup in the event of failure. Each PMM pair controls a

mirrored pair of NPMUs. (We use mirroring to survive

NPMU failures without data loss.) The PMM is charged

with managing access to the NPMUs, as well as with

managing their metadata. The metadata must be kept

consistent at all times in order to facilitate recovery should

the system fail. The metadata essentially consist of

information describing allocated portions of persistent

memory (e.g., owner, access rights, physical location in

PM, etc).

The final component of our persistent memory

system is the client access library. Clients access PM

volumes, which are roughly analogous to disk volumes.Each volume is single logical entity consisting of a

mirrored pair of NPMUs controlled by a single PMM

process pair. Each volume contains some number of PM

regions, which are the PM analog to files. Regions are

created by the PMM in response to “create” messages sent

from the client API to the PMM process. Once regions

have been created, they may be opened by one or more

clients.

When a region is “open,” the PMM maps a

contiguous range of NPMU’s network virtual addresses to

its physical memory. This mapping exists in the address

translation hardware of the NPMU’s ServerNet interface.

It not only specifies address translation but also enforces a

limited form of access control, allowing the PMM to

specify which CPUs have access to a specific range of

NPMU’s network virtual addresses.

Once a PM region has been opened by the PMM,

clients can perform RDMA read and write operations

directly to the NPMU memory comprising that region.

Therefore, the client API performs ServerNet RDMA read

or write operations directly to the NPMU device at one of

the NPMU virtual addresses within the allocated range. To

preserve data integrity the API writes data to both the

primary and mirror NPMUs; reads need not be replicated.

API operations are typically synchronous. This is

acceptable because they are expected to be small and fast,

and this simplifies the persistence characteristics. I.e.,

when the call returns the data is either persistent or the call

will return in error. Note that this can be very efficient

because ServerNet packets are automatically

acknowledged in hardware, and when ServerNet transfer

completes without error, the packet is guaranteed to have

arrived in the remote NIC with a correct CRC.

4.2 NonStop PM Prototype

To obtain performance results before we had

hardware NPMUs, we created a prototype NPMU using an

NSK process to mimic the device. We call this prototype

NPMU a Persistent Memory Process (PMP). A PMP

allocates a large region of memory and exposes that

memory to ServerNet reads and writes in a manner similar

to a hardware NPMU. This gives the PMP all of the

performance characteristics of a hardware NPMU except

for the non-volatility. (We have since verified this claim,

and have found that a true hardware PMU is actually

slightly faster than the PMPs used in the experiments

reported below.)

All other aspects of our prototype were designed to

be the same as for a hardware NPMU. So we implemented

a PMM and client library using ServerNet RDMA

read/write operations to access the PMP’s memory

directly.

To test the utility of persistent memory, we modified

NSK’s audit data process (ADP). ADP is the process pair

in NSK that functions as a log writer. The ADP

coordinates its actions with the NSK disk process (DP2)

and with NonStop transaction monitor facility (TMF), in

order to provide ACID properties and transactions. Our

modified ADP synchronously writes database log data to

persistent memory. Therefore, the database log is

persistent immediately, and transactions can commit faster

than if the log data had to be flushed to disk at commit

time. For scaling audit throughput, multiple ADPs can be

configured per node.

4.3 Description of the benchmark

To test the effect of PM on transaction processing

performance, we used a “hot-stock” benchmark developed

by Paul Denzinger [7]. This test consists of up to 4 driver

processes. Each driver represents a single hotly-traded

stock. The drivers each insert 32000 4K records. The

database consists of 4 files, each distributed across 4 disk

volumes (a total of 16 disk volumes were used). During

each transaction each driver performs a number of

asynchronous inserts into each file. The transactions are

committed between subsequent iterations to simulate the

regulatory ordering constraints.

We ran the hot-stock benchmark on a 4-processor

NSK S86000 system with varying amounts of boxcarring

and measured the total amount of time elapsed both with

and without PM-enabled ADPs. Because this benchmark

generated a large amount of audit, we used 4 auxiliary

audit volumes, one for each CPU. For the PM-enabled

experiments we ran a PMP on a 5th

CPU, and each ADP

used a separate region of the PMP’s memory.

4.4 Results obtained

We measured performance for the following

transaction sizes:

128K – 32 4Kbyte inserts per transaction

64K - 16 4Kbyte inserts per transaction

32K – 8 4Kbyte inserts per transaction

We also ran the experiments with varying numbers

of driver processes. Even though we measured

performance with 3 or 4 drivers, more than two hot stocks

at a single time is very rare.

Figure 1 shows speedup in response time vs.

transaction size (e.g., degree of boxcarring). Figure 2

shows the total elapsed time vs. the transaction size.

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

4.5 Analysis of Results

Figure 1 demonstrates the improvement in response

time with persistent memory. Response time was up to 3.5

times better with a PM enabled ADP. The benefit of PM

was greatest with the more common 1-2 hot-stock case,

though there was improvement even with 3 or 4 hot

stocks.

The effect of PM is even more dramatically shown in

Figure 2. Here we plot total time of the benchmark vs. the

degree of boxcarring. Since the number of transactions

was fixed, the throughput is inversely proportional to the

total execution time. This graph shows that the throughput

with large boxcar sizes is fine for the standard ADP, but as

the amount of boxcarring decreases, throughput drops off

sharply. For a PM enabled ADP, the throughput is

virtually unaffected by the amount of boxcarring. So,

applications do not need to artificially combine operations

in order to maintain throughput.

5 Conclusions and future work

In the near future, fast and flexible persistence will

enable high-performance, scalable ODS with very low

MTTR. In the long term, PM has the potential to alter how

durability is specified and optimized. We have highlighted

two interesting possibilities below, but the technology we

have described here has the potential to vastly change how

servers work, for the better!

5.1 Memory-mapped persistent memory

In Section 3.2, we mentioned that direct-connected

PM is a long-term option. The access path for such

memory is entirely hardware-based. Correct

implementation requires the compilers to optimize load

and store instructions differently, and the microprocessors

to not complete stores against certain addresses in store

buffers or on-chip caches. It is an open question whether

overloading existing language mechanisms, such as the

volatile qualifier of C language will do the job. There

is also the issue of durable metadata, and possible

automation of metadata generation, for retrieving

persistent richly connected data structures into an address

space different than the one from which they were stored.

5.2 Compiler-assisted I/O

A real opportunity exists to bring the power of

compilers to bear on optimization of I/O. The

programming languages and compilers in commercial use

today do not have good abstractions for persistence.

Storage I/O is largely optimized by hand, and is

considered one of the core skills of server-side application

programmers. (Networking is the other; but that is the

topic for another paper!) Any hardware innovation that

attempts to bridge the storage gap, inevitably runs into the

challenge of modifying the major body of data tier

applications to take advantage of the new technology.

Higher-level persistence models, such as J2EE’s, allow

persistence to be specified at a high level and to be

optimized automatically.

Acknowledgements

The authors would like to thank Gary Hall and Jim Mills

for helping run the benchmark workloads; Bob Taylor and

Paul Denzinger, for introducing us to the Hot Stock

problem; and Gary Smith, for modifying NonStop ADP so

we could study the impact of PM on a real ODS. Karel

Youssefi and Barbara Tabaj helped us in procuring and

configuring the benchmark system. Figure 2: PM eliminates the need to boxcar

0

0.5

1

1.5

2

2.5

3

3.5

4

32k 64k 128k

Transaction Size (larger size = more boxcarring)

Resp

on

se T

ime S

peed

up

wit

h P

M

1 Driver

2 Drivers

3 Drivers

4 Drivers

Figure 1: PM improves response time drastically

0

20

40

60

80

100

120

140

160

32k 64k 128k

Transaction Size (larger size = more boxcarring)

Ela

psed

Tim

e (

sec

s)

1 Driver No PM

2 Drivers No PM

1 Driver PM

2 Drivers PM

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

REFERENCES

1. Gray, Jim, Why do computers stop and what can be done about it?, Tandem TR-85.7, Tandem

Computers Inc., June 1985.

(http://www.hpl.hp.com/techreports/tandem/TR-

85.7.html)

2. Tehrani, S., et al., Progress and Outlook for

MRAM Technology, IEEE Transactions on

Magnetics, Vol. 35, No. 5, September 1999.

3. Compaq/Intel/Microsoft. Virtual Interface

Architecture Specification, version 1.0, Dec.

1997.

4. InfiniBand Specification, version 1.0a, InfiniBand

Trade Association, June 2001.

(http://www.infinibandta.org/home)

5. Hilland, Jeff, et al., RDMA Protocol Verbs

Specification (Version 1.0), RDMA Consortium,

April 2003. (http://www.rdmaconsortium.org)

6. MPI-2: Extensions to the Message Passing

Interface , University of Tennessee, Knoxville,

TN, 1997. (http://www.mpi-forum.org)

7. Denzinger, P. and Taylor, B., Technical

Challenges in SuperMontage, Proc. HP

TechCon 2003, Boulder, CO, Hewlett-Packard,

April 2003.

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)