Fast and flexible persistence: the magic potion for fault-tolerance, scalability and performance in...
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of Fast and flexible persistence: the magic potion for fault-tolerance, scalability and performance in...
Fast and Flexible Persistence: The Magic Potion
for Fault-Tolerance, Scalability and Performance
in Online Data Stores
Pankaj Mehra and Sam Fineberg
Hewlett-Packard {pankaj.mehra,fineberg}@hp.com
Abstract This paper examines the architecture of computer
systems designed to update, integrate and serve enterprise
information. Our analysis of their scale, performance,
availability and data integrity draws attention to the so-
called 'storage gap.' We propose 'persistent memory' as
the technology to bridge that gap. Its impact is demonstrated using practical business information
processing scenarios. The paper concludes with a
discussion of our prototype, the results achieved, and the
challenges that lie ahead.
1 ODS Architecture
Servers are computer systems that can concurrently
process a large number of requests from a large number of
requestors. Organizations that own and operate servers –
ISPs (Internet service providers) and business enterprises –
are rarely able to provide all of the content and services to
their customers from a single server. The complexity of
serving depends upon the richness of content and services
offered, as well as upon the number of clients supported.
Hierarchy helps combat the curse of complexity: Servers,
storage systems, and routers are arranged in levels – or
tiers – grouped by function, as well as by security
attributes. Three tiers of servers are commonly deployed:
the presentation tier, the application tier, and the data tier.
Servers run the applications that transform stored
data into information and business objects, which are then
either directly exposed to client applications as content, or
exposed to other servers through programmatic interfaces
as Web Services. The precise characterization of a server’s
workload depends upon its tier, the particular application
mix it is running, the data behind it, as well as upon the
nature of requests.
This paper focuses on servers in the data tier. At an
abstract level, data tier servers function as platforms for
database and fileserver software, and bridge an
enterprise’s storage subsystems to its application
infrastructure. Requests originating either in the network
or above application programming interfaces specify
operations against stored data. A key characteristic of
these operations is that they are encapsulated inside
transactions, an abstraction that makes possible the vital
ACID properties of atomicity, consistency, isolation and
durability.
Online data stores (ODS) are transaction processing
systems that, besides exhibiting the ACID properties, also
possess the performance, scale and availability to integrate
all of the transactionally manipulated information needed
by large-scale, adaptive enterprises. For instance, ODS for
telecommunication companies: support the insertion of
tens of thousands of call-data records per second;
simultaneously provide data to billing, marketing and
fraud detection applications; exhibit near-zero outage
seconds per year; neither lose transactions nor corrupt
their data; and can readily scale to meet the needs of a
rapidly growing customer base or newly acquired
customers. In addition, they can seamlessly integrate data,
transactionally or otherwise, from all relevant data
sources, be it call center applications, accounting software
or data sources external to the enterprise.
Similar examples of ODS can be found in retail,
finance, supply-chain management, and other key
segments of the information technology market.
This paper presents persistent memory, a
technology that combines the durability of disks with the
speed of memory through the use of memory-semantic
system area networks. It is argued that persistent memory
will enable the construction of high-performance, scalable
and fault-tolerant ODS.
The rest of Section 1 defines the terminology,
architecture and objectives of ODS design. Section 2
defines ODS workload characteristics and their
implications. Section 3 introduces our persistent memory
architecture. Section 4 summarizes our experiences from
recent work in using persistent memory with ODS
workloads. Section 5 concludes the paper.
0-7695-2132-0/04/$17.00 (C) 2004 IEEE
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)
1.1 Transaction processing
A transaction is a collection of operations on the
physical and abstract application state, usually represented
in a database. A transaction represents the execution of a
transaction program. Operations include reading and
writing of shared state. A transaction program specifies the
operations that need to be applied against application state,
including the order in which they must be applied and any
concurrency controls that must be exercised in order for
the transaction to execute correctly. The most common
concurrency control operation is locking, whereby the
process corresponding to the transaction program acquires
either a shared or exclusive lock on the data it reads or
writes.
A transaction processing system is a computer
hardware and software system that supports concurrent
execution of multiple transaction programs while ensuring
that ACID properties are preserved.
Atomicity: Transactions exhibit an all-or-none behavior, in
that a transaction either executes completely or not at
all. A transaction that completes is said to have been
committed; one that is abandoned during execution is
said to have been aborted; one that has begun
execution but has neither committed nor aborted is
said to be in-flight.
Consistency: Successful completion of a transaction leaves
the application state consistent vis-à-vis any
constraints on database state.
Isolation: Also known as serializability, isolation
guarantees that every correct concurrent execution of
a stream of transactions corresponds to some total
ordering on the transactions that constitute the stream.
In that sense, with respect to an executed transaction,
the effects of every other transaction in the stream are
the same as either having executed strictly before or
strictly after it.
Strong serializability: The degree to which the
execution of concurrent transactions is
constrained creates different levels of isolation in
transaction processing systems. ODS seek to
provide the strongest forms of isolation, in which
the updates made by a transaction are never lost
and repeated read operations within a transaction
produce the same result. The discussion of lesser
levels of isolation, namely cursor stability and
browse access can be found in standard database
texts.
Durability: Once a transaction has committed, its changes
to application state survive failures affecting the
transaction-processing system.
1.2 System architecture for transaction
processing
A transaction processing system contain the
following key components. The database writer mutates
the data stored on data volumes on behalf of transactions.
To ensure durability of those changes, it sends them off to
a log writer, which records them on durable media in a
fashion that they can be undone or redone. This record of
changes is called the database audit trail. It explicitly
records the changes made to the database by each
transaction, and implicitly records the serial order in which
the transactions committed. Before a transaction can
commit, the relevant portion of the audit trail must be
flushed to durable media. The log writer coordinates its
I/O operations with the transaction monitor, which keeps
track of transactions as they enter and leave the system. It
keeps track of the database writers mutating the database
on behalf of each transaction, and ensures that the changes
related to that transaction sent to the log writer by the
database writers are flushed to permanent media before the
transaction is committed. It also notates transaction states
(e.g., commit or abort) in the audit trail.
1.3 ODS capabilities of interest
Throughput: Data-tier server benchmarks measure request
throughput, generally in units of database transactions
per minute. Throughput in turn depends upon other,
narrower capabilities of component subsystems, as
well as upon the way a server has been architected
from its component subsystems. Key transaction-
processing throughput-related subsystem parameters
are:
processor-to-memory bandwidth, specified in
gigabytes per second, or GB/s;
link bandwidth of inter-processor communication and
I/O expansion networks, specified in gigabits per
second, or Gbps;
bisection width, which is the number of links in the
weakest cross-sections of those networks;
input/output (IO) operations per second (IOPS) for
randomly accessed durable media; and
access bandwidth (often measured as megabytes per
second, or MB/s) of sequentially accessed
durable media.
Scalability: The ability of a server to handle ever greater
request-processing loads can be assessed in the
context of either a single server or a cluster of servers.
The former case represents scaling up and the latter,
scaling out. Servers that scale out support, out of their
I/O subsystems, application-visible network interfaces
into a high-bandwidth, low-latency, message-passing
interconnection network, such as InfiniBand. That
network supports a variety of clustering functions,
including distributed memory parallel processing,
cluster file system, and failover. This allows the
construction of scalable and highly available request-
processing solutions. Servers that scale up support,
out of their processor-memory subsystems,
application- and OS-invisible interfaces into a shared
memory network. That network supports a cache
coherence protocol — usually some variety of
directory-based cache coherence — among the
memory controllers associated with individual
processors or processor groups. An additional IOEN
0-7695-2132-0/04/$17.00 (C) 2004 IEEE
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)
(I/O expansion network), such as PCI Express or
StarGen, is then used in scaled-up servers to provide a
large number of I/O attachment points for the large
number of processors.
On-line transaction processing throughput can then be
scaled by partitioning the randomly-accessed data
across multiple data volumes (disk drives), with each
disk bringing additional IOPS capability. Further
throughput gains also accrue from the concurrent use
of multiple disks for sequentially accessed data, which
scales the total available disk bandwidth. Of course,
making use of either of these techniques requires the
use of an IOEN or IN with sufficient bisection width,
and scalable algorithms and protocols for concurrency
control and partitioning.
Availability: Organizations that automate business-critical
functions using servers seek to avoid or minimize
service outage, periods during which requests cannot
be serviced whether due to faults, planned
maintenance, or any other reason. Availability is often
measured by the number of leading 9s in the ratio
MTBF/(MTBF+MTTR). MTBF (mean time between
failures) is the inverse of the rate at which
components or subsystems fail; it is a reliability
measure. MTTR (mean time to recovery) measures
the degree of component redundancy in the system, as
well as measures the efficacy with which failed
components can be detected and then either repaired
in place or replaced by configuring in redundant
components. Highly available servers supporting 5 or
more 9s of availability provide fewer than 10 outage
minutes per year! Designs for achieving 6 or 7 9s of
availability are already in progress.
Critical transaction processing functions – database
writer, log writer, and transaction monitor – are
therefore often performed by process pairs,
comprising a primary and a backup process [1].
Periodically, and always before externalizing state
changes, the primary process check-points those
changes to its backup. If a software failure hits the
primary process, the backup process takes over
instantly, without having to reconstruct its state.
Data Integrity: Business-critical servers, such as those that
handle credit-card transactions, need to guard against
accidental corruption of data due to software or
hardware malfunction. Data integrity is characterized
using metrics such as probability of SDC (silent data
corruption) in the stateful components of the server,
namely memory and storage, and the probability of
undetected bit errors in the various data paths, namely
the busses and the networks inside the server. The
lower are these probability values, the greater is the
data integrity of the server.
The most common method of ensuring data integrity
is the duplicate-and-compare (D&C) approach, in
which the results of redundant computations, with
identical data and in identical state, are compared.
Failed comparisons indicate data corruption.
2 ODS Workload
ODS workloads tend to be insert heavy. The most
important consequence of an insert-heavy workload are (a)
high volume of check-point traffic between process pairs;
and (b) high volume of audit trail information.
Often, key portions of ODS workload stream will
exhibit response-time critical (RTC) behavior. This occurs
whenever application dependencies force transaction
response time into the denominator of the transaction
throughput equation. For instance, certain applications
wait to issue further transactions until the previously
issued ones have committed, while demanding high
throughput per application thread, or from a small number
of threads.
Consider for instance the case of ODS for stock
exchanges. Streams of buy and sell orders arrive from
brokerage systems and must be queued and matched to
generate trades. As a throughput optimization technique,
multiple trades are often boxcarred into a single database
transaction. Regulatory dependencies push the response
time of prior trade transactions into the issue path of
subsequent transactions involving the same security,
which leads to an RTC behavior known as the Hot Stock
problem. On any given day, a few ‘headline stocks’ may
account for a sizeable portion of the day’s trading volume;
however, the throughput on such stocks is inversely
proportional to the response time of trade transactions. To
complicate things further, even though the average
throughput across securities in the same data partition
improves with greater degree of boxcarring, the
transaction response times lengthen, thereby limiting
throughput on hotly traded securities.
Modern development frameworks for stateful
applications, such as Entity Beans of Enterprise Java
Beans, offer a variety of persistence mechanisms under the
umbrella of object-relational mapping. In practice, with
most of these architectures, the issue rate (thereby the
throughput) of a single application server thread is
inversely related to the response time of database
operations. Legacy batch-oriented transaction-processing
applications likewise exhibit RTC-like behavior.
The long pole in completing and committing
transactions is the action of making the effects of their
data-manipulation operations durable. One key
consequence of response-time-criticality in insert-heavy
ODS workloads is therefore the need to write audit trails
with high throughput and low latency.
Inevitably, information is made durable by either
writing to disks or copying around in a clustered system
through checkpoints and multicasts. This is why the
completion time of at least one – and typically more than
one – disk I/O, and the round-trip times of several request-
reply message operations, are included in the response
time of every transaction that obeys the benchmark ACID
properties of transaction processing systems. A fast and
flexible alternative to disks – and to excessive data
copying in lieu of going to disks – is highly desirable for
achieving high performance on the aforementioned
applications.
0-7695-2132-0/04/$17.00 (C) 2004 IEEE
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)
3 Persistent Memory
3.1 Nonvolatility
Persistent memory (PM) is memory that is durable,
without refresh. Its contents survive the loss of system
power. It additionally provides durable, self-consistent
metadata in order to ensure continued access to data after
power loss or soft failures. It is designed to exploit high-
speed durable media emerging in response to demands
from server and storage industry, on the one hand, and
consumer device industry on the other.
There is an emerging market full of potential high-
speed successors to FLASH memory. These include
principally Magnetic RAM (MRAM) from Motorola [2],
Cypress, NVE, IBM-Infineon and Toshiba, as well as
OUM, TFE, FeRAM, McRAM and a host of other slower
contenders. While the current 4Mbit/chip MRAM
capacities are woefully inadequate for enterprise needs, a
number of battery-backed DRAM (BBDRAM) products
fill the ‘storage gap’ that exists between RAM and disk,
albeit at the cost of system complexity due to batteries and
backup media.
3.2 First-level I/O attachment
Disk-based storage sub-systems routinely incorporate
BBDRAM as write caches, and various research projects
are designing architectures to provide storage-semantic
access to MRAM. However, the performance of such
systems is hampered by (a) connectivity via second-level
interconnects (such as ATA, SCSI and Fibre Channel);
and (b) block-oriented access via storage-oriented
protocols and device drivers. The handling of SCSI
commands, DMA, interrupts and context switching results
in 100s of microseconds – usually milliseconds – of I/O
latency.
Both BBDRAM and MRAM allow for direct
connectivity to the CPU-memory sub-system [6]. Directly
connected PM may be accessed from user programs like
ordinary virtual memory, albeit at specially designated
process virtual addresses, using the CPU’s memory
instructions (Load and Store). Our first-generation PM
specifications ruled out this method of attachment
primarily because the memory falls in the same fault
domain as the CPU. Moreover, the semantics of store
instructions in microprocessors, and the associated
compiler optimizations, can also play havoc with
durability guarantees. Directly-connected PM will be an
attractive long term option, especially as the protection
and isolation features of shared memory interconnects and
microprocessors evolve further, and as compiler support
for persistence matures.
Our architecture instead specifies attaching PM to a
first-level I/O interconnect supporting both memory
mapping and memory-semantic access, as well as
providing adequate fault isolation between PM device and
CPU on the one hand, and between mirrored PM devices
on the other. Examples of suitable interconnects include
PCI Express, RDMA over IP, InfiniBand, and HP
ServerNet.
3.3 Access architecture
We call this type of PM devices NPMUs (Network
Persistent Memory Units). NPMUs are implemented as
network resources accessed using RDMA read and write
operations, or using equivalent semantics through an API.
While originally proposed to facilitate device-
initiated RDMA, host-initiated memory-semantic access to
network-attached resources is a uniquely attractive way to
address high-speed non-volatile memory. It incurs only
10s of microseconds of latency. Its byte-grained access
and alignment eliminate the costly read-modify-write
operations required by block devices. Coupled with
address translation capabilities of RNICs (RDMA-enabled
Network Interface Cards), memory semantics also
eliminate the costly marshalling-and-unmarshalling of
pointer-rich data required by conventional storage.
Due to its long-latency, disk storage is best accessed
using asynchronous I/O techniques; persistent memory, on
the other hand, is fast enough to support synchronous
interfaces, especially for short access sizes.
In the near term, persistent memory will be accessed
using synchronous variants of RDMA read and write
interfaces similar to those described in VI Architecture
specification [3], InfiniBand Verbs layer [4], or ICS
Consortium’s IT API [5]. In the future, persistent memory
has the potential to bring storage truly under the memory
hierarchy. We will briefly speculate about those
possibilities in the final section of the paper.
3.4 Benefits of Persistent Memory
We now examine how ODS stand to benefit from the
characteristics of PM, such as low latency, byte grain,
synchronous interfaces, and low overheads of making
pointer-rich data durable.
Low latency of making information durable: Clearly, the
throughput of a transaction processing system on RTC
workloads will benefit from the use of a low-latency
alternative to disk drives. The ability to flush changes
to persistent memory instead of disk drives will allow
these systems to commit transactions rapidly without
sacrificing strong serializability.
Efficient data movement between address spaces: By
extending the standard memory-address translation
logic of network interfaces with its unique meta-data
handling capabilities, persistent memory greatly
increases the efficiency with which richly-connected
data structures can be copied between address spaces.
Like standard remote direct memory access [5], and
like one-sided memory operations in message-passing
interface specification 2.0 [6], persistent memory
semantics decouple data movement from
synchronization. Second, persistent memory supports
a variety of hardware-assisted pointer-fixing schemes,
including bulk write-selective read and incremental
update-bulk read. Marshalling-unmarshalling of data
0-7695-2132-0/04/$17.00 (C) 2004 IEEE
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)
structures, whether for check-pointing between
process pairs or for the purpose of saving on durable
media, can be drastically reduced or eliminated. This
allows ODS data structures, such as database indices,
lock tables and transaction control blocks, to be
efficiently stored to durable media.
Enablement of fine-grained persistence: Since PM is fast
and flexible, it enables applications to persist data that
would have been too cumbersome and too expensive
to persist with the traditional I/O programming model.
Just as easily, PM also supports transactional updating
of persistent stores, with an access architecture not
dissimilar to the mmap() and msync() primitives of
memory-mapped files. The difference is in the
performance because PM accesses are synchronous
and can go directly from user-mode software to
hardware RNICs. A more important difference is that
PM semantics offer a road to a durable information
store that is completely integrated into the memory
hierarchy. In any case, being able to update indices,
lock tables and transaction control blocks at a fine
grain reduces uncertainty regarding the state of the
database, and eliminates costly heuristic searching of
audit trail information, leading to shorter MTTR,
which is the mantra for both better availability and
data integrity.
Toward a persistence architecture: Through its unique use
of RDMA-enabled interconnects to address devices
using host-initiated RDMA, and through its
enablement of byte-grain access, PM closes the
storage gap and provides a durable address space at
the same speed and flexibility as remote directly-
accessed memory. For ODS that support enterprise
information integration using EJB entity beans or
other stateful web services mechanisms, this starts to
take away some of the pain of using the most
advanced persistence mechanisms, such as container-
managed persistence. Moreover, by unifying remote-
memory-based and disk-based performance
mechanisms, PM allows ODS programmers to focus
on the specification — rather than the implementation
— of persistence. For instance, on insert-heavy ODS
workloads, newly inserted rows or row-sets would be
made persistent once when they enter the database
writer, by synchronously writing to the NPMU. This
would then eliminate the repeated, wasteful and
uncoordinated persistence actions taken by ODS
components as the inserted data pass first from the
database writer primary to backup, then as audit
‘delta’ from the database writer to the log writer, then
again from the log writer to its backup, from the
database writer to data volumes and from the log
writer to log volumes.
4 Experiences with a persistent memory
prototype
HP is currently developing NPMUs for its NonStop
Servers. These servers run the NonStop Kernel (NSK)
operating system. NonStop servers consist of up to 256
“nodes,” each a cluster of up to 16 MIPS processors and
associated I/O devices. The processors and I/O devices are
interconnected by a redundant ServerNet network. While
each node presents a single system image, the system does
not support shared memory. Instead, processes
communicate using messages over the ServerNet network.
I/O adapters also attach into the ServerNet network, and
are managed across the network by ‘driver’ processes
running on NSK processors. This separation between
device manager and devices guarantees that devices can
continue to function even if the controlling processor fails.
ServerNet is a true RDMA network, supporting both
RDMA read and write operations in hardware.
ServerNet’s software latency is between 10 and 20
microseconds, depending on the generation of ServerNet
technology utilized. Its address space structure is that of a
single-address-space RDMA network, in that each
ServerNet end point presents a 32-bit network virtual
memory address space to initiators on the network.
Critical NonStop system functions are implemented
using process pairs. In the event of failure, the fault
detection and message re-routing capabilities of NSK and
the NonStop hardware allow a backup process to take over
from its primary process in a second or less. This allows
NSK to survive failures without any loss of committed
data and with virtually no loss of availability.
4.1 Architecture of the NonStop PM
Persistent memory technology is deployed on
NonStop servers in three pieces. First is the NPMUs,
which are additional devices attached to ServerNet.
Second comes the client-side PM access library and its
supporting NSK components. Finally, there is the PM
Manager, which manages NPMUs.
The NPMUs look a lot like traditional NSK I/O
adapters, but with the addition of non-volatile RAM.
However, the usage model of NPMUs is quite different
from that of storage I/O adapters. I/O adapters traditionally
act as RDMA initiators; i.e., when NSK wishes to read a
block from disk, it sends a request to a SCSI adapter, the
CPU in the SCSI adapter initiates a read from disk into
adapter RAM, and when the disk operation is complete,
the CPU in the adapter initiates an RDMA write to send
the data to the NSK CPU. This makes sense because the
NSK CPU can not directly interact with the disk since it is
on the other side of a SCSI interface. However, this model
inherently adds latency since a (potentially slow) CPU in
the adapter must be involved in data transfer. For PM,
however, there is no second-level device interaction
needed. Thus, it is possible for PM clients to directly read
and write memory in an NPMU, without any involvement
by a CPU in the NPMU. This inherently provides lower
latency than is possible with the typical I/O adapter usage
model.
To allow memory-like client access to PM, while still
providing data persistence, the NPMU must be managed
like a storage device. Therefore, our architecture uses a
Persistent Memory Manager (PMM) process pair for all
management functions. The PMM is a normal NSK
0-7695-2132-0/04/$17.00 (C) 2004 IEEE
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)
system process pair, and like other critical system services,
designed to transparently fail over from its primary to its
backup in the event of failure. Each PMM pair controls a
mirrored pair of NPMUs. (We use mirroring to survive
NPMU failures without data loss.) The PMM is charged
with managing access to the NPMUs, as well as with
managing their metadata. The metadata must be kept
consistent at all times in order to facilitate recovery should
the system fail. The metadata essentially consist of
information describing allocated portions of persistent
memory (e.g., owner, access rights, physical location in
PM, etc).
The final component of our persistent memory
system is the client access library. Clients access PM
volumes, which are roughly analogous to disk volumes.Each volume is single logical entity consisting of a
mirrored pair of NPMUs controlled by a single PMM
process pair. Each volume contains some number of PM
regions, which are the PM analog to files. Regions are
created by the PMM in response to “create” messages sent
from the client API to the PMM process. Once regions
have been created, they may be opened by one or more
clients.
When a region is “open,” the PMM maps a
contiguous range of NPMU’s network virtual addresses to
its physical memory. This mapping exists in the address
translation hardware of the NPMU’s ServerNet interface.
It not only specifies address translation but also enforces a
limited form of access control, allowing the PMM to
specify which CPUs have access to a specific range of
NPMU’s network virtual addresses.
Once a PM region has been opened by the PMM,
clients can perform RDMA read and write operations
directly to the NPMU memory comprising that region.
Therefore, the client API performs ServerNet RDMA read
or write operations directly to the NPMU device at one of
the NPMU virtual addresses within the allocated range. To
preserve data integrity the API writes data to both the
primary and mirror NPMUs; reads need not be replicated.
API operations are typically synchronous. This is
acceptable because they are expected to be small and fast,
and this simplifies the persistence characteristics. I.e.,
when the call returns the data is either persistent or the call
will return in error. Note that this can be very efficient
because ServerNet packets are automatically
acknowledged in hardware, and when ServerNet transfer
completes without error, the packet is guaranteed to have
arrived in the remote NIC with a correct CRC.
4.2 NonStop PM Prototype
To obtain performance results before we had
hardware NPMUs, we created a prototype NPMU using an
NSK process to mimic the device. We call this prototype
NPMU a Persistent Memory Process (PMP). A PMP
allocates a large region of memory and exposes that
memory to ServerNet reads and writes in a manner similar
to a hardware NPMU. This gives the PMP all of the
performance characteristics of a hardware NPMU except
for the non-volatility. (We have since verified this claim,
and have found that a true hardware PMU is actually
slightly faster than the PMPs used in the experiments
reported below.)
All other aspects of our prototype were designed to
be the same as for a hardware NPMU. So we implemented
a PMM and client library using ServerNet RDMA
read/write operations to access the PMP’s memory
directly.
To test the utility of persistent memory, we modified
NSK’s audit data process (ADP). ADP is the process pair
in NSK that functions as a log writer. The ADP
coordinates its actions with the NSK disk process (DP2)
and with NonStop transaction monitor facility (TMF), in
order to provide ACID properties and transactions. Our
modified ADP synchronously writes database log data to
persistent memory. Therefore, the database log is
persistent immediately, and transactions can commit faster
than if the log data had to be flushed to disk at commit
time. For scaling audit throughput, multiple ADPs can be
configured per node.
4.3 Description of the benchmark
To test the effect of PM on transaction processing
performance, we used a “hot-stock” benchmark developed
by Paul Denzinger [7]. This test consists of up to 4 driver
processes. Each driver represents a single hotly-traded
stock. The drivers each insert 32000 4K records. The
database consists of 4 files, each distributed across 4 disk
volumes (a total of 16 disk volumes were used). During
each transaction each driver performs a number of
asynchronous inserts into each file. The transactions are
committed between subsequent iterations to simulate the
regulatory ordering constraints.
We ran the hot-stock benchmark on a 4-processor
NSK S86000 system with varying amounts of boxcarring
and measured the total amount of time elapsed both with
and without PM-enabled ADPs. Because this benchmark
generated a large amount of audit, we used 4 auxiliary
audit volumes, one for each CPU. For the PM-enabled
experiments we ran a PMP on a 5th
CPU, and each ADP
used a separate region of the PMP’s memory.
4.4 Results obtained
We measured performance for the following
transaction sizes:
128K – 32 4Kbyte inserts per transaction
64K - 16 4Kbyte inserts per transaction
32K – 8 4Kbyte inserts per transaction
We also ran the experiments with varying numbers
of driver processes. Even though we measured
performance with 3 or 4 drivers, more than two hot stocks
at a single time is very rare.
Figure 1 shows speedup in response time vs.
transaction size (e.g., degree of boxcarring). Figure 2
shows the total elapsed time vs. the transaction size.
0-7695-2132-0/04/$17.00 (C) 2004 IEEE
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)
4.5 Analysis of Results
Figure 1 demonstrates the improvement in response
time with persistent memory. Response time was up to 3.5
times better with a PM enabled ADP. The benefit of PM
was greatest with the more common 1-2 hot-stock case,
though there was improvement even with 3 or 4 hot
stocks.
The effect of PM is even more dramatically shown in
Figure 2. Here we plot total time of the benchmark vs. the
degree of boxcarring. Since the number of transactions
was fixed, the throughput is inversely proportional to the
total execution time. This graph shows that the throughput
with large boxcar sizes is fine for the standard ADP, but as
the amount of boxcarring decreases, throughput drops off
sharply. For a PM enabled ADP, the throughput is
virtually unaffected by the amount of boxcarring. So,
applications do not need to artificially combine operations
in order to maintain throughput.
5 Conclusions and future work
In the near future, fast and flexible persistence will
enable high-performance, scalable ODS with very low
MTTR. In the long term, PM has the potential to alter how
durability is specified and optimized. We have highlighted
two interesting possibilities below, but the technology we
have described here has the potential to vastly change how
servers work, for the better!
5.1 Memory-mapped persistent memory
In Section 3.2, we mentioned that direct-connected
PM is a long-term option. The access path for such
memory is entirely hardware-based. Correct
implementation requires the compilers to optimize load
and store instructions differently, and the microprocessors
to not complete stores against certain addresses in store
buffers or on-chip caches. It is an open question whether
overloading existing language mechanisms, such as the
volatile qualifier of C language will do the job. There
is also the issue of durable metadata, and possible
automation of metadata generation, for retrieving
persistent richly connected data structures into an address
space different than the one from which they were stored.
5.2 Compiler-assisted I/O
A real opportunity exists to bring the power of
compilers to bear on optimization of I/O. The
programming languages and compilers in commercial use
today do not have good abstractions for persistence.
Storage I/O is largely optimized by hand, and is
considered one of the core skills of server-side application
programmers. (Networking is the other; but that is the
topic for another paper!) Any hardware innovation that
attempts to bridge the storage gap, inevitably runs into the
challenge of modifying the major body of data tier
applications to take advantage of the new technology.
Higher-level persistence models, such as J2EE’s, allow
persistence to be specified at a high level and to be
optimized automatically.
Acknowledgements
The authors would like to thank Gary Hall and Jim Mills
for helping run the benchmark workloads; Bob Taylor and
Paul Denzinger, for introducing us to the Hot Stock
problem; and Gary Smith, for modifying NonStop ADP so
we could study the impact of PM on a real ODS. Karel
Youssefi and Barbara Tabaj helped us in procuring and
configuring the benchmark system. Figure 2: PM eliminates the need to boxcar
0
0.5
1
1.5
2
2.5
3
3.5
4
32k 64k 128k
Transaction Size (larger size = more boxcarring)
Resp
on
se T
ime S
peed
up
wit
h P
M
1 Driver
2 Drivers
3 Drivers
4 Drivers
Figure 1: PM improves response time drastically
0
20
40
60
80
100
120
140
160
32k 64k 128k
Transaction Size (larger size = more boxcarring)
Ela
psed
Tim
e (
sec
s)
1 Driver No PM
2 Drivers No PM
1 Driver PM
2 Drivers PM
0-7695-2132-0/04/$17.00 (C) 2004 IEEE
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)
REFERENCES
1. Gray, Jim, Why do computers stop and what can be done about it?, Tandem TR-85.7, Tandem
Computers Inc., June 1985.
(http://www.hpl.hp.com/techreports/tandem/TR-
85.7.html)
2. Tehrani, S., et al., Progress and Outlook for
MRAM Technology, IEEE Transactions on
Magnetics, Vol. 35, No. 5, September 1999.
3. Compaq/Intel/Microsoft. Virtual Interface
Architecture Specification, version 1.0, Dec.
1997.
4. InfiniBand Specification, version 1.0a, InfiniBand
Trade Association, June 2001.
(http://www.infinibandta.org/home)
5. Hilland, Jeff, et al., RDMA Protocol Verbs
Specification (Version 1.0), RDMA Consortium,
April 2003. (http://www.rdmaconsortium.org)
6. MPI-2: Extensions to the Message Passing
Interface , University of Tennessee, Knoxville,
TN, 1997. (http://www.mpi-forum.org)
7. Denzinger, P. and Taylor, B., Technical
Challenges in SuperMontage, Proc. HP
TechCon 2003, Boulder, CO, Hewlett-Packard,
April 2003.
0-7695-2132-0/04/$17.00 (C) 2004 IEEE
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)