Archie: A Speculagve Replicated Transacgonal ... - Hyflow project

23
Archie: A Specula.ve Replicated Transac.onal System Sachin Hirve, Roberto Palmieri , Binoy Ravindran Systems So>ware Research Group Virginia Tech

Transcript of Archie: A Speculagve Replicated Transacgonal ... - Hyflow project

Archie:  A  Specula.ve  Replicated  Transac.onal  System  

Sachin  Hirve,  Roberto  Palmieri,  Binoy  Ravindran  Systems  So>ware  Research  Group  

Virginia  Tech  

2

What do we want for our transactional application?

Strong consistency

High availability

Great scalability in small/med size

High performance

Fault tolerance Low latency

Programmability

3

Commonly  used  building  blocks  

Full Replication Total Order (e.g., Paxos)

4

How  does  it  work  when  combined?    

Ordering  phase  Client    Request  (tx1)  

Client    Reply  

R2  R1  

L  

R3   R4  

msg  (tx1)   execu.on  of  (tx1)  

tx1  

tx1   tx2  

tx2  

5

…and  when  the  load  increases?  

Ordering  phase  Tx1  

Ordering  phase  Tx2  

Ordering  phase  Tx3  

Ordering  phase  Tx4  

Ordering  phase  Tx5  

Ordering  phase  Tx6  

Tx1  

Tx2  

Tx3  

Tx4  

Tx5  

Tx6  

6

Goal:  Boost  performance  in  the  common  case  

 Challenge:  

Transac.on  response  .me  =  total  order  latency  

7

An.cipate  the  work  

Ordering  phase   Tx1  

Tx1  

Ordering  phase   Tx2  

Tx2  

Ordering  phase   Tx3  

Tx3  

•  Exploita.on  of  an  early  no.fica.on  triggered  during  the  ordering  phase  (called  op#mis#c-­‐delivery)  

8

Reliable  Op.mis.c  Delivery  

•  Avoid  any  mismatch  between  the  op.mis.c  delivery  order  and  total  order  

•  Maximize  the  .me  between  the  op.mis.c  delivery  and  the  establishment  of  the  total  order  

Ordering  phase   Tx1  

Ordering  phase  

Tx2  

Ordering  phase  

Tx3  

Tx1  

Tx2  

Tx3  

Tx1  Tx2  Tx3  

Op.mis.c  Order  

Tx1  Tx2  Tx3  

Total  Order  

Stable  leader  

9

MIMOX  

•  Mul.  Paxos  +  Reliable  Op.mis.c  Delivery  – Exploit  the  leader’s  batching  .me  for  maximizing  the  overlapping  .me  

– Exploit  the  leader’s  knowledge  for  making  the  op.mis.c  delivery  reliable  

•  Tag  messages  with  leader’s  expected  order  

10

MIMOX:  Batch  crea.on  

R2  

R1  L  

Client  Requests  

Batching  window  

1  

11

MIMOX:  Batch  broadcast  

R2  

R1  L  

Client  Requests  

1   1  1  

1  

1  2   2  

2  

3   3  

3  

Batching  window  

Op$mis$c  Delivery  

Op$mis$c  Delivery  

1  2  3  

1  2  3  

12

MIMOX:  Performance  

90

95

100

105

110

115

120

125

130

3 5 7 9 11 13 15 17 19

1000x

msg

s per

sec

Nodes

10 bytes20 bytes50 bytes

0

2

4

6

8

10

12

14

3 5 7 9 11 13 15 17 19m

illis

eco

nds

(mse

c)

Nodes

10 bytes20 bytes50 bytes

13

Transac.on  Processing  

•  Exploit  parallelism  •  Commit  in  order  •  Minimal  overhead  for  conflict  detec.on  and  resolu.on  

Ordering  phase   Tx1  

Ordering  phase  

Tx2  

Ordering  phase  

Tx3  

Tx1  

Tx2  

Tx3  

Tx1  Tx2  Tx3  

Op.mis.c  Order  

Tx1  Tx2  Tx3  

Total  Order  

14

Key  points  of  ParSpec  

•  Reduce  problem’s  complexity  by  ac.va.ng  MaxSpec  transac.ons  at-­‐a-­‐.me  – meta-­‐data’s  size  is  fixed    – set  of  possible  conflic.ng  transac.ons  is  limited  

•  Parallel  execu.on  but  in-­‐order  specula.ve  commit  (x-­‐commit)  

15

ParSpec:  details  T1   T2   T3   T4  

T2  (X):  Begin   T3  (X):  Begin  

Read(X)  

0   0   0   0  

X:  Read  bit-­‐array  

0   0   0   0  

0   1   1   0  

0   0   0   0  

Abort  bit-­‐array  

0   0   0   0  

0   0   0   0  

Write(X)   0   0   0   0  

Wait  

Read(X)   0   0   1   0   0   0   0   0  

X-­‐commit  Abort;  Re-­‐execute  

0   1   1   0  

Op.mis.c  Order:  

0   0   1   0   0   0   0   0  

MaxSpec  =  4  T5   T6   T7   T8  

0   0   1   0  OR  bitwise  

16

Is  this  enough  for  achieving  the  goal?  

•  Unfortunately  not…  

Ordering  phase  Tx1  

Δ  

Valida.on  ReadSet  

Commit  WriteSet  

≈Δ  

17

Solu.on  for  Write  Set  

•  Specula.ve  versions  are  associated  with  an  tenta.ve  commit  .mestamp,  which  is  the  expected  commit  .mestamp  in  case  of  no  mismatch  between  op.mis.c  and  total  order  

Ordering  phase  Tx1  

Valida.on  Read  Set  

Commit  Write  Set  

TS++  

18

Solu.on  for  Read  Set  

•  If  the  leader  is  not  either  crashed  or  suspected  (it  is  stable),  then  there  is  no  chance  of  mismatch  between  op.mis.c  and  total  order  

•  ParSpec  commits  according  to  op.mis.c  order  and  make  them  visible  during  the  total  order  

Ordering  phase  Tx1  

Valida.on  ReadSet  

?  

19

We  made  it!  

TS++  Ordering  phase  Tx1  

20

Evalua.on  

•  Testbed  –  PRObE  cluster  (19  nodes)  –  AMD  Opteron  6272,  64-­‐core,  2.1  GHz  CPU  –  128  GB  RAM  and  40  Gbps  Ethernet  

•  Benchmarks  –  Bank,  TPC-­‐C  and  Vaca.on  

•  Compe.tors  –  PaxosSTM:  DUR-­‐based  approach;  it  suffers  from  remote  aborts  –  SM-­‐DER:  Post  final  delivery  single-­‐thread  execu.on  –  HiperTM:  SM-­‐DER  based  single  thread  execu.on  –  Archie-­‐FD:  Archie  but  post  final  delivery  parallel  processing  

21

Evalua.on:  TPC-­‐C  

Throughput  19  warehouses  

0

5000

10000

15000

20000

25000

3 5 7 9 11 13 15 17 19

Tran

sact

ions

per

sec

Replicas

SMRab

ARCHIE-FDARCHIE

0

5000

10000

15000

20000

25000

3 5 7 9 11 13 15 17 19

Tran

sact

ions

per

sec

Replicas

SMRab

ARCHIE-FDARCHIE

0

5000

10000

15000

20000

25000

3 5 7 9 11 13 15 17 19

Tran

sact

ions

per

sec

Replicas

SMRHiperTM

PaxosSTM

ARCHIE-FDARCHIE

0

5000

10000

15000

20000

25000

3 5 7 9 11 13 15 17 19

Tran

sact

ions

per

sec

Replicas

SMRHiperTM

PaxosSTM

ARCHIE-FDARCHIE

0

5

10

15

20

25

3 5 7 9 11 13 15 17 19

1000

x Tr

ansa

ctio

ns p

er s

ec

Replicas

SM-DERHiperTM

PaxosSTM

ARCHIE-FDARCHIE

20

30

40

50

60

70

80

90

100

1 19 100

% X

-co

mm

itte

d t

ran

sa

ctio

ns

Number of warehouses

3 nodes11 nodes19 nodesThroughput  

100  warehouses  

22

Evalua.on:  Distributed  Vaca.on  

Conten.on  level  

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

100 250 500 750 1000

Tra

nsa

ctio

ns

per

sec

Number of relations

SM-DER

HiperTM

PaxosSTM

ARCHIE-FD

ARCHIE

23

Thanks!

Questions?

Research project’s web-site: www.hyflow.org