When Ceph Meets SPDK

36
When Ceph Meets SPDK Hequan Zhang 2016.11.19

Transcript of When Ceph Meets SPDK

When Ceph Meets SPDKHequan Zhang

2016.11.19

研发核心

• 团队A:来自一线互联网,国内Ceph社区主要贡献者

• 团队B:来自IT领导厂商的存储产品研发

关于 XSKY | 星辰天合

• 2015年5月成立,总部位于北京,深圳拥有研发机构

• 北极光创投和红点投资,A轮前融资额¥7200万

• 员工70+人,研发+服务~50人

• 公司愿景:提供企业就绪的分布式软件定义存储产品,帮

助客户实现数据中心架构革新。

• 产品:X-EBS分布式块存储,X-CBS云计算后端存储,

X-EOS分布式对象存储等

关于我们

Future Ready SDS

Outline

• Introduction

• Background

• SPDK introduction

• BlueStore

• Conclusion

Outline

• Introduction

• Background

• SPDK introduction

• BlueStore

• Conclusion

OpenStack Use Cases

About Ceph

http://www.revolvermaps.com/?target=enlarge&i=1lzi710tj7s&color=80D2DC&m=0

What is Ceph?

• Object, block, and file storage in a single cluster

• All components scale horizontally

• No single point of failure

• Hardware agnostic, commodity hardware

• Self-manage whenever possible

• Open source (LGPL)

• “Ceph is a distributed object store and file system designed

to provide excellent performance, reliability and scalability.”

Ceph Components

Outline

• Introduction

• Background

• SPDK introduction

• BlueStore

• Conclusion

Background

• Low performance of Ceph’s storage service

• Ceph’s architecture is designed on the low storage devices (with ms latency level)

• There are more and more fast devices in both network and storage

• Network: 10G/25G/40G/100G (low performance -> high performance)

• Storage: HDD -> SATA SSD -> PCIe SSD -> NVDIMM (high latency -> low latency)

• Challenge: Software design and implementation in Ceph is the bottleneck

• Equipped with those fast devices, Software needs to be refreshed to explore the

limitation of those hardware devices.

Potential solutions

Invent a new ObjectStore/FileStore design and implementation in the followingaspects

• API Change• Synchronous APIs -> Asynchronous APIs (POSIX -> NON-POSIX)

• Benefit: Obtaining performance via completing several requests instead of one.

• I/O stack optimization: • Replace Kernel I/O stacks with user space stacks (e.g., Network I/O, Storage I/O )

• Benefit: No context switch, no data copy among kernel and user space, locked

architecture -> unlocked architecture

• SPDK (storage performance development kit, https://www.spdk.io/) provides a set of libraries to address such issues.

Outline

• Introduction

• Background

• SPDK introduction

• BlueStore

• Conclusion

SPDK Introduction

Built on Intel® Data Plane Development Kit (DPDK)

Software infrastructure to accelerate the packet input/output to Intel CPU

User space Network Services (UNS)

TCP/IP stack implemented as polling, lock-light library, bypassing kernel bottlenecks, and

enabling scalability

User space NVMe, Intel® Xeon®/Intel®Atom™Processor DMA,

and Linux* AIO drivers

Optimizes back end driver performance and prevents kernel bottlenecks from forming at the

back end of the I/O chain

*Other names and brands may be claimed as the property of others.

SPDK architecture Overview

Extends Data Plane Development Kit concepts through an end-to-end storage context• Optimized, user-space lockless polling in the NIC driver, TCP/IP stack, iSCSI target, and NVMe driver

• iSCSI and NVMe over Fabrics targets integrated

Exposes the performance potential of current and next-generation storage media• Media latencies moving from low-μsec to nsec, storage software architectures must keep up

• Permissive open source license for user-space media drivers: NVMe & CBDMA drivers are on github.com

• Media drivers support both Linux* and FreeBSD*

NVMf Application and Protocol Library:• Provisioning, Fabric Interface Processing , Memory Allocation, Fabric Connection Handling, RDMA Data Xfer

• Discovery, Subsystems, Logical Controller, Capsule Processing, Manage Interface with NVMe Driver library

*Other names and brands may be claimed as the property of others.

https://software.intel.com/en-us/articles/introduction-to-the-storage-performance-development-kit-spdk

Performance comparison:User space NVMe driver VS Kernel NVMe driver

4 KB Random Read Performance: 4 x NVMe DrivesSingle-Core Intel®Xeon® Processor

SPDK NVMe driver delivers up to 6x performance improvement

vs. Kernel NVMe driver with a single-core Intel®Xeon®processor

Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and

MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary.

You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when

combined with other products. For more information go to http://www.intel.com/performance.

From 10/28 spdk meetup

4 KB Random Read Performance: 1-4 NVMe DrivesSingle-Core Intel®Xeon®Processor

SPDK NVMe driver scales linearly in performance

from 1 to 4 NVMe drives with a single-core Intel®Xeon®processor

Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and

MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary.

You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when

combined with other products. For more information go to http://www.intel.com/performance.

From 10/28 spdk meetup

What SPDK can do to improve Ceph?

• Accelerate the backend I/Os in Ceph OSD (Object storage service)

• Key solution: Replace the Kernel drivers with user space NVMe drivers

provided by SPDK to accelerate the I/Os on NVMe SSDs.

• Accelerate the network performance (TCP/IP) in Ceph’s internal

network.

• Key solution: Replace the existed network solution provided by kernel in each

OSD Node with DPDK + User space TCP/IP stack (e.g., LIBUNS, SEASTAR,

MTCP and etc.).

Outline

• Introduction

• Background

• SPDK introduction

• BlueStore

• Conclusion

BlueStore

• BlueStore = Block + NewStore

Consume raw block device(s)

key/value database (RocksDB) for metadata

data written directly to block device

pluggable block Allocator

ObjectStore

BlueStore

BlockDevice

Device

Kernel Driver Userspace NVMe Driver

RocksDB

BlueRocksEnv

BlueFS

Allocator

metadatadata

Performance Status—Sequential Write(HDD)

From Sage 06/21 Talk

Performance Status—Random Write(HDD)

From Sage 06/21 Talk

Performance Status—Sequential read(HDD)

From Sage 06/21 Talk

Performance Status—Random read(HDD)

From Sage 06/21 Talk

BlueStore Architecture

DPDK-Messenger Plugin

Posix

Worker

Posix

Worker

Posix

Worker

Posix

Worker

KernelDPDK

Worker

DPDK

Worker

DPDK

Worker

DPDK

Worker

PosixStack DPDKStack

NetworkStack

AsyncMessenger

AsyncConnection

DPDK

Worker

TCP

IP

ARP

RTE

DPDK PMD

=

DPDK-Messenger Design

• TCP, IP, ARP, DPDKDevice:

• hardware features offloads

• port from seastar tcp/ip stack

• integrated with ceph’s libraries

• Event-drive:

• UserspaceEvent Center(like epoll)

• NetworkStack API:

• Basic Network Interface With Zero-copy or Non Zero-copy

• Ensure PosixStack<-> DPDKStack Compatible

• AsyncMessenger:

DPDK-Messenger Improvements

5M 10M 20M 40M 80M 160M

message counts

dpdk posix

https://github.com/ceph/ceph/pull/10748

DPDK-Messenger Open Source

NVMe Device

• Status

Userspace NVMe Library(SPDK)

Already in Ceph master branch

DPDK integrated

IO Data From NIC(DPDK mbuf) To Device

Details

Posix

Worker

Posix

Worker

Posix

Worker

Posix

Worker

KernelDPDK

Worker

DPDK

Worker

DPDK

Worker

DPDK

Worker

PosixStack DPDKStack

PG PG PG PG PG

Bluestore

NetworkStack

AsyncMessenger

AsyncConnection

KernelDevice NVMeDevice

Device

Improvements

Userspace nvme driver Open Source

https://github.com/ceph/ceph/pull/7145

Outline

• Introduction

• Background

• SPDK introduction

• BlueStore

• Conclusion

Summary

• There are performance issues in Ceph with the emerging fast network and storage

devices.

Storage system need to refactor to catch up hardware.

Ceph is hoped to change to share-nothing implementation.

• Mainly, We introduce SPDK and Bluestore to address the current issues in Ceph.

SPDK: Libraries (e.g., user space NVMe driver) can be used for performance

acceleration.

BlueStore: Invent a new store to implement lockless, asynchronous and high

performance storage service.

• Lots of details need to work(coming soon)

THANK YOU

2016.11.19