Decentralized Scalar Timestamp - USENIX

48
Unifying Timestamp with Transaction Ordering for MVCC with D ecentralized S calar T imestamp Xingda Wei, Rong Chen, Haibo Chen, Zhaoguo Wang, Zhenhan Gong, Binyu Zang Institute of Parallel and Distributed Systems@Shanghai Jiao Tong University 1

Transcript of Decentralized Scalar Timestamp - USENIX

Unifying Timestamp with Transaction Ordering for MVCC with Decentralized Scalar Timestamp

Xingda Wei, Rong Chen, Haibo Chen, Zhaoguo Wang, Zhenhan Gong, Binyu Zang Institute of Parallel and Distributed Systems@Shanghai Jiao Tong University

1

Multi-versioning is important for TXs[1]

Data are stored in multiple-copies (aka, versions)

n e.g., updates not overwrite (as in single-versioning)

2

Single-version database

Multi-version (MV) database

vs.

A=1 A=2

A=1 Av0=1Av1=2

A+=1TX1

[1] Transactions

Data are stored in multiple-copies (aka, versions)

Higher concurrency for readers writers

n Reader reads from the (consistent) past

n Write concurrently updates the latest

Multi-versioning is important for TXs[1]

3

Av0=1Av1=2

A+=1TX1

MV database

Reader:

Writer:

Print(A)TX2

[1] Transactions

How to assign version & choose version to read?

Multi-versioning is important for TXs[1]

Data are stored in multiple-copies (aka, versions)

Higher concurrency for readers writers

n Reader reads from the (consistent) past

n Write concurrently updates the latest

4

Av0=1Av1=2

A+=1TX1

Print(A)TX2

MV database

Reader:

Writer:

[1] Transactions

5

How to assign version & choose version to read?

How to assign version & choose version to read?

Deciding version should be consistent & efficient

n Consistent = Match TX’s order

6

Print(A)TX1 A+=1TX2 A+=1TX3...Time Av0=0

AV0 < v1 < ... < v2 n i.e., versions sorted by TXs

Av2=1Av1=1

How to assign version & choose version to read?

Deciding version should be consistent & efficient

n Consistent = Match TX’s order

n Efficient: minimal impact to the TX

7

Yet, ordering timestamps is challenging in networked systems

n Recall the goals: Consistency + Efficiency

A common approach: use timestamp

8

Yet, ordering timestamps is challenging in networked systems

n Recall the goals: Consistency + Efficiency

#1. Physical clock

ü Nearly no cost

✘ Not consistent: unsynchronized clocks

A common approach: use timestamp

9

Server0 Server1

Network

Yet, ordering timestamps is challenging in networked systems

n Recall the goals: Consistency + Efficiency

#2. Global logical clock (GTS)

ü Easy to ensure consistency

A common approach: use timestamp

10

Timestamp Server

Network

TX1

TX2

TXn

...clock

clock

A common approach: use timestamp

Yet, ordering timestamps is challenging in networked systems

n Recall the goals: Consistency + Efficiency

#2. Global logical clock (GTS)

ü Easy to ensure consistency

✘ Non-trivial costs

11

clock

clock

Extra network requests per TX

A common approach: use timestamp

Yet, ordering timestamps is challenging in networked systems

n Recall the goals: Consistency + Efficiency

#2. Global logical clock (GTS)

ü Easy to ensure consistency

✘ Non-trivial costs

✘ Scalability bottleneck

12

Timestamp Server

Handles all the requests

Yet, ordering timestamps is challenging in networked systems

n Recall the goals: Consistency + Efficiency

#3. Global vectorized clock (VTS)

n Per-worker clock

A common approach: use timestamp

13

Timestamp Server

Network

TX1

TX2

TXn

...clock

clock

...

Yet, ordering timestamps is challenging in networked systems

n Recall the goals: Consistency + Efficiency

#3. Global vectorized clock (VTS)

ü Reduce requests to the global server

A common approach: use timestamp

14

Timestamp Server

Network

TX1

TX2

TXn

...clock

clock

...

Yet, ordering timestamps is challenging in networked systems

n Recall the goals: Consistency + Efficiency

#3. Global vectorized clock (VTS)

ü Reduce requests to the global server

✘ Still needs the global server

A common approach: use timestamp

15

Timestamp Server

...

Yet, ordering timestamps is challenging in networked systems

n Recall the goals: Consistency + Efficiency

#3. Global vectorized clock (VTS)

ü Reduce requests to the global server

✘ Still needs the global server

✘ Non-scalable meta-data

A common approach: use timestamp

16

...

One clock per-worker

Can we make the timestamps ordered,

with (nearly) zero cost?

Observation: concurrency control (CC)

18

Recall: consistency = timestamps match TXs’ order

n While CC orders TXs (e.g., 2PL OCC)

Physical clocks are approximately good

n Hybrid clock: using CC in read-write TX to order physical clocks

This work: Decentralized Scalar Timestamp

Scalable scalar timestamp mechanism to enable MVCC

n Provide snapshot reads for read-only TX (i.e., bypasses original CC)

Consistent & efficient scalar hybrid physical-logical clock

n Reader uses the physical clock to read snapshots (bypass CC)

n DST is consistent by piggybacking w/ read-write TX’s CC

19

This work: Decentralized Scalar Timestamp

Scalable scalar timestamp mechanism to enable MVCC

n Provide snapshot reads for read-only TX (i.e., bypasses original CC)

20

Read-write(TX1) (TX2) Read-onlyw/o DST

Network

Network

TX start

TX end

Read A

A

Data server

Concurrency control(CC)e.g., 2PL, OCC, etc

Read data(A)

This work: Decentralized Scalar Timestamp

Scalable scalar timestamp mechanism to enable MVCC

n Provide snapshot reads for read-only TX (i.e., bypasses original CC)

21

Read-write(TX1) (TX2) Read-onlyw/ DST

Network

Network

TX start

TX end

Read AvAv

Data server

Read w/o CC

v =

This work: Decentralized Scalar Timestamp

Use physical clock as an tentative timestamp

n Minimal cost & Scalable: scalar + no global timestamp

22

(TX2) Read-onlyw/ DST

v =

This work: Decentralized Scalar Timestamp

Use the (un-sync) physical is not sufficient for read-write TXs

n DST is consistent by piggybacking w/ read-write TX’s CC

23

Read-write(TX1) Data server

Network

TX start

TX end

v =Write Av CC

Piggyback on CC:Update v to be consistent

This work: Decentralized Scalar Timestamp

Use physical clock as an tentative timestamp

n Also fresh: snapshot has bounded staleness (e.g., ms-scale)

24

(TX2) Read-onlyw/ DST

v =

Decentralized (Scalar) Timestamp is not so new

Based on many prior work on decentralized timestamps

n TicToc@SIGMOD’18, Cicada@SIGMOD’17, etc

Differently:

n DST is a general timestamp method for different CC

n DST further targets MVCC in a distributed setting

25

Outline of the remaining content

How to piggyback on CC for read-write TX ? DST rules

How snapshot reads bypass CC? DST reads

How good is DST’s snapshot read? Bounded staleness

Evaluation results

27

Outline of the remaining content

How to piggyback on CC for read-write TX ? DST rules

28

DST rules for read-write TX

29

Write-Write

DST rules for read-write TX

Inspired from conflicting relationships of TXs

Two TXs conflict if and only if:

30

TX access Read Read

Read - Read-WriteWrite Write-Read

Timestamp should adjust follow these rules, too !

Write-Write

Write-Write

31

Write-Write

If TX2 has a write-write conflict with TX1

nTimestamptx2(Tx2) should > Timestamptx1 (Tx1)

32

TimeA v0

Tx1

Write(A)

TX1

Write(A)

TX2

Write-Write rule#2:TX'= MAX(TX,A.ver)+ 1

Write-Write rule#1:A.ver = TX1

TX2'

TX2'>TX1

Achieved via storing the (last-writer) timestamp with data

Example: Write-Write in 2PL

33Time

Write(A)TX1RPC

Lock(A)do other things …

Write & Unlock(A)

RPC

Server

2PLA.lock();

2PLA.write(…);A.unlock();

Can be simply piggybacked to 2PL in a lightweight way

A

Write & Unlock(A)

Example: Write-Write in 2PL

34Time

RPC

Lock(A)

RPC

Server

2PL+DSTA.write(…);A.ver = TS;A.unlock();

Can be simply piggybacked to 2PL in a lightweight way

Start

DSTTS=clock();

DST

TS=MAX(TS,A.ver)+1

A

Write(A)TX1

2PL+DSTA.lock();ret A.ver;

DST rules for read-write TX

Inspired from conflicting relationships of TXs

Two TXs conflict if and only if:

35

TX access Read Read

Read - Read-Write

Write Write-Read Write-Write

Similar to Write-Write

Similar to Write-Write

Timestamp should adjust follow these rules, too !

Outline of the remaining content

How snapshot reads bypass CC? DST reads

36

DST reads

37

DST reads

Key benefit: Bypass CC for the reader

n e.g., avoid read locks in 2PL

Can we use MVCC w/ DST the same as global clock?

38

DST reads

Key benefit: Bypass CC for the reader

n e.g., avoid read locks in 2PL

Can we use MVCC w/ DST the same as global clock?

39Time

Read(A)TX1

Server

Start

DSTTS=clock();

RPC

Read A@TS

MVCC readRead A w/ version less up to TS;

DST reads

Key benefit: Bypass CC for the reader

n e.g., avoid read locks in 2PL

Can we use MVCC w/ DST the same as global clock?

n No, because timestamp of reader does not use CC !

n i.e., not apply DST rules because rules are piggyback to the CC !

40

DST reads

Key benefit: Bypass CC for the reader

n e.g., avoid read locks in 2PL

DST reads (a lightweight rule for snapshot) to fix this:

41Time

Read(A)TX1

Server

RPC

Read A@TX Start

DSTTS=clock();

MVCC read + DSTA.ver = MAX(A.ver,TS);Read A w/ version less up to TS;

Outline of the remaining content

Evaluation results

44

DST works well various CCs

n 2PL MySQL cluster X DST

n OCC DrTM+R[Eurosys’16] X DST

n ROCOCO ROCOCO[OSDI’14] X DST

Also on various network settings

n Local clusters (TCP/IP), AWS cloud, RDMA

Evaluation results of DST

45

DST works well various CCs

n 2PL MySQL cluster X DST

n OCC DrTM+R[Eurosys’16] X DST

Also on various network settings

n Local clusters (TCP/IP), AWS cloud, RDMA

Evaluation results of DST

46

SmallBank Thpt(MTxs/sec)

Number of clients

2PL X DST

Workloads: TPC-C,Smallbank

nDST ~= RC in performance

47

TPC-C Thpt(KTxs/sec)

Number of clients

Beneficial to write-mostly workloadsn Reduce lock between read/write

Cluster Name CPU Network Num

Val 2 X 12 cores 10GbE 16

OCC X DST

Workloads: TPC-E (read-mostly)

nDST also ~= RC in performance

48

Cluster Name CPU Network Num

AWS r4.2xlarge (8 vcpu) 10GbE 32

VALR 2 X 12 cores 100Gbp IB (RDMA) 16

Latency (ms)

Throughput (K reqs/sec)

Latency (ms)

Throughput (K reqs/sec) w/RDMA

OCC X DST

Workloads: TPC-E (read-mostly)

nNear zero-cost timestamp overhead

49

Cluster Name CPU Network Num

AWS r4.2xlarge (8 vcpu) 10GbE 32

VALR 2 X 12 cores 100Gbp IB (RDMA) 16

Latency (ms)

Throughput (K reqs/sec)

Latency (ms)

Throughput (K reqs/sec) w/RDMA

Conclusion

DST provides serializable & fresh MVCC in a networked setting

n Also generalized to different CCs

Efficient & Scalable: Near zero-cost timestamp overhead

Thanks & QA!

Backup

51