2B-SSD: The Case for Dual, Byte- and Block-Addressable ...
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of 2B-SSD: The Case for Dual, Byte- and Block-Addressable ...
2B-SSD: The Case for Dual, Byte- and Block-Addressable
Solid-State Drives
Duck-Ho Bae, Insoon Jo, Youra Adel Choi, Joo-Young Hwang,
Sangyeun Cho, Dong-Gi Lee, Jaeheon Jeong
Memory Business, Samsung Electronics
NVRAMOS 2018
Memory Hierarchies
HDD / SSD
DRAM
Block I/F
Conventional Two-level Hierarchy
PM
DRAM
Memory I/F
Persistent Memory
Store
SSD
DRAM PM
Heterogeneous Memory
(DRAM + PM)
SSD
DRAM
PM
Hybrid Store (PM in SSD)
2/21
Heterogeneous Memory System
Its NVRAM (a battery-backed DRAM) requires expensive and complex logic
e.g., External power source, FPGA for power failure detection & recovery
In this memory system, NVRAM is used as write buffers
DRAM-like fast, yet persistent
Perfect fit for database logging and file system journaling due to lazy flushing
e.g., 2X speed up with Microsoft SQL Server 2016
SSD-based hybrid store can be a better solution for this scenario!
3/21
SSD-based Hybrid Store
Why? NVMe SSDs already have right ingredients to realize persistent memory
Memory interface: PCIe interconnect
Persistent memory (PM): a portion of internal DRAM plus additional capacitors
(No external power source!)
Power failure handling: Added logic to SSD controller (No newly added FPGA!)
Moreover, an internal datapath between PM and NAND flash can be built
Typically, logs and journals are written as bytes, but read by large chunk later
4/21
Overall Architecture
SRAM
Channel #0
NAND
…
Channel #(nch-1)
NAND
… NAND
… NAND
Byte access
Block I/O access
PC
Ie in
terf
ace
DRAM PM
NVMe controller
ATU
FMC
FMC
ARM cores LBA
checker
Read DMA
PLP
BAR manager
Read DMA Engine
Recovery manager
BA-buffer manager Mapping table
start_LBA EID length 1 2
start_offset
Hardware component
Software component
5/21
BAR Manager
NAND
DRAM Buffer
NAND NAND
Application
Host Virtual Address
Host System
BAR1
NAND
NVMe I/F
LOAD/STORE (BYTE)
Read/Write (PAGE)
1. Opens up a memory space
visible to CPU
BAR manager enables an additional
BAR (BAR1) for byte granule file access
BAR: “How the device advertise the
amount of address range it needs”
2. Redirects memory accesses from CPU into internal DRAM • The device is responsible for mapping
internal resources to the host-visible memory ranges
• BAR manager employs an address translation unit (ATU)
6/21
BAR Manager (contd.)
NAND
DRAM Buffer
NAND NAND
Application
Host Virtual Address
Host System
BAR1
NAND
NVMe I/F
LOAD/STORE (BYTE)
Read/Write (PAGE)
3. Exploits write combining (WC) mode of the underlying CPU • Individual writes are combined into a larger
burst in CPU’s WC buffer • 64 bytes in size in current x86 CPUs
• It leads to a significant reduction of memory accesses
Register store
store
store WC buffer
7/21
BA-buffer Manager
NAND
DRAM Buffer
NAND NAND
Application
Host Virtual Address
Host System
BAR1
NAND
NVMe I/F
LOAD/STORE (BYTE)
Read/Write (PAGE)
Maintains a memory hierarchy of DRAM and NAND • BA-buffer logic runs on an ARM core
within 2B-SSD
• The BA-buffer management APIs are designed to enable applications to allocate memory on the BA-buffer, and read and write files using them
• The mapping table stores information between DRAM addresses and NAND data
• (1) entry_id, (2) start_offset in the BA-buffer, (3) start_LBA of a given file, and (4) length
8/21
Recovery Manager
Recovery manager to turn the BA-buffer into a persistent
memory consists of
• Additional capacitance large enough to save BA-buffer contents and the BA-buffer mapping table in a reserved area of the NAND flash memory before 2B-SSD turns completely off
• Recovery logic that runs data protection procedures launched by power loss detection circuitry
9/21
Usage and Durability Guarantee
Two steps for ensuring ordering and durability of writes
WC buffer → Root Complex
Root Complex → BA-buffer
CPU
BA-Buffer Root
Complex
Caches
WC Buffer PCIe
② clflush() & mfence() ③ BA-SYNC()
NAND Flash
① BA-PIN()
④ BA-FLUSH()
10/21
2B-SSD API
BA_PIN(EID, offset, LBA, length)
BA_FLUSH(EID)
BA_SYNC(EID)
BA_GET_ENTRY_INFO(EID)
BA_READ_DMA(EID, dst, length)
11/21
Experimental Setup
• H/W setup
• Basic performance results • Write/read latencies, write/Read bandwidth
• Application level results • Database logging (PostgreSQL, RocksDB, Redis)
System Dell PowerEdge R730 server
CPU 2 Intel Xeon(R) CPU E5-2699
(18 threads per socket) @2.30GHz
Memory 256 GiB DRAM
OS 64-bit Ubuntu 14.04
SSD DC-SSD (PM963), ULL-SSD (SZ985), 2B-SSD
12/21
2B-SSD
Write Latency (QD1, 4KB random write)
6.6X 13.2 μs
Z-SSD (SZ985)
4KB
2B-SSD prototype implemented on Samsung Z-SSD
4KB
2 μs
8B
630 ns
1KB
930 ns
2B-SSD
Z-SSD (SZ985) performance
• 4KB Random Read Latency ~13us
• 4KB Random Write Latency ~10us
Electrolytic capacitor
Z-NAND flash memory
DRAM
PCIe interface
Z-NAND flash
memory 2B-SSD Ctrl
Item Description
Host interface PCIe Gen.3 x4 (3.2GB/s), NVMe 1.2
Device density 800 GB
Storage medium Samsung Z-NAND flash memory
Capacitance of capacitors 270 uF x3
BA-buffer size 8 MB
Max entries of BA-buffer 8
13/21
Two Separate Datapaths on 2B-SSD
NAND
DRAM Buffer
NAND NAND
Application
Host Virtual Address
Host System
BAR1
NAND
NVMe I/F
LOAD/STORE (BYTE)
Read/Write (PAGE)
• Write Latency
4KB
2 μs
10 μs
Block 8B
630 ns
1KB
930 ns
Memory 4KB
• Read Latency
4KB
151 μs
13 μs
Block I/O 8B
2 μs
256B
10 μs
Memory 4KB 4KB
w/ DMA
58 μs
14/21
Write Bandwidth
ULL-SSD > 2B-SSD
• Latest NVMe SSDs exploit hardware-automated datapath for optimized block I/O
• Internal datapath between BA-buffer and NAND flash are excluded from this automation
2B-SSD > DC-SSD
16/21
Read Latency
MMIO < ULL-SSD
150us
13.2us
PCIe Read is non-posted (sync) transaction
MMIO < DC-SSD
17/21
Read Latency
ULL-SSD < Read DMA (13us vs. 58us)
The read DMA engine helps
accelerate slow memory read.
Reading by DMA is faster than
MMIO, but still slower than
block I/O.
Non-DMA > DMA
18/21
Case Study: Database Logging
- Dell R730, Xeon [email protected] * 36, 256GB DDR4, Ubuntu 14.04 (kernel v4.6.3), PgSQL v9.6.0, RocksDB v5.1.4, Redis 3.2.4, SZ985 storage with ext4 mounted, 64 clients (Linkbench), 64B payload size (YCSB-A)
: Asynchronous logging
Linkbench on PostgreSQL
SYNC Commit
BA Commit
X1.7
43K
74K
YCSB-A on RocksDB
X2.8
27K
76K
YCSB-A on Redis
X1.8
15K
27K
SYNC Commit
BA Commit SYNC Commit
BA Commit
20/21
Conclusions
• This paper described the motivation, design, and implementation of a byte- and block-addressable solid-state drive.
• Through 2B-SSD APIs, applications can write and read any number of bytes on it without forcing the data being buffered in the host memory.
• We demonstrate the results where major database engines can see a throughput gain of up to 2.8X without the risk of data loss.
21/21