Rachid Guerraoui Luis Rodrigues
-
Upload
khangminh22 -
Category
Documents
-
view
6 -
download
0
Transcript of Rachid Guerraoui Luis Rodrigues
2
Roadmap
(1) Overview and basic abstractionsâ Overviewâ Broadcastâ Consensus
(2) Advanced abstractionsâ Motivation: database replicationâ Atomic commitâ Total order broadcast
3
Programming abstractions
Sequential programmingâ Array, set, record ,..
Concurrent programmingâ Thread, semaphore, monitor
Distributed programmingâ ???
4
Programming abstractions
Sequential programmingâ Array, set, record ,..
Concurrent programmingâ Thread, semaphore, monitor
Distributed programmingâ Reliable broadcast, shared memory,
consensus, etc
6
Reliable broadcastCausal order broadcast
Shared memoryConsensus
Total order broadcastAtomic commitLeader election
Terminating reliable broadcast
7
Underlying abstractions
(1): processes (abstracting computers)
(2): channels (abstracting networks)
(3): failure detectors (abstracting time)
8
Processes
The distributed system is made of a finite set S of processes p1,..pNEach process models a sequentialprogramProcesses have unique identities andknow each otherIn this tutorial, processes can only failby crashing and do not recover after a crash: a correct process does not crash
9
Processes
The program of a process is made of a finite set of modules (or components) organized as a software stackModules within the same processinteract by exchanging eventsupon event < Event1, att1, att2,..> doâ // somethingâ trigger < Event2, att1, att2,..>
10
Modules of a process
(deliver)
indication
request
(deliver)
indication
(deliver)
indicationrequest
request
11
Channels
We assume reliable links (also calledperfect) throughout this tutorialRoughly speaking, reliable links ensurethat messages exchanged betweencorrect processes are not lostMessages are uniquely identified andthe message identifier includes the senderâs identifier
12
Channels
Two primitives: pp2pSend and pp2pdeliverProperties:â PL1. Validity: If pi and pj are correct, then every
message sent by pi to pj is eventually delivered by pj
â PL2. No duplication: No message is delivered (to a process) more than once
â PL3. No creation: No message is deliveredunless it was sent
13
Failure Detection
A failure detector is a distributedoracle that provides processes withsuspicions about crashed processesIt is implemented using (i.e., itencapsulates) timing assumptionsAccording to the timing assumptions, the suspicions can be accurate or not
14
Failure Detection
We assume in this tutorial perfectfailure detection:â Strong Completeness: Eventually, every
process that crashes is permanentlysuspected by every correct process
â Strong Accuracy: No process is suspectedbefore it crashes
15
Roadmap
(1) Overview and basic abstractionsâ Overviewâ Broadcastâ Consensus
(2) Advanced abstractionsâ Motivation: database replicationâ Atomic commitâ Total order broadcast
16
Broadcast
Broadcast is useful for instance in applications where some processessubscribe to events published by otherprocesses (e.g., stocks), and require somereliability guarantees from the broadcastservice (we say sometimes quality of serviceâ QoS) that the underlying network does notprovideBroadcast is also useful for (database) replication (as we will see later)
17
Broadcast
We shall consider three forms ofreliability for a broadcast primitiveâ (a) Best-effort broadcastâ (b) (Regular) reliable broadcastâ (c) Uniform (reliable) broadcast
We shall give first specifications andthen algorithms
18
Best-effort broadcast (beb)
Eventsâ Request: <bebBroadcast, m>â Indication: <bebDeliver, src, m>
Properties: BEB1, BEB2, BEB3
19
Best-effort broadcast (beb)
Propertiesâ BEB1. Validity: If pi and pj are correct,
then every message broadcast by pi iseventually delivered by pj
â BEB2. No duplication: No message isdelivered more than once
â BEB3. No creation: No message isdelivered unless it was broadcast
22
Reliable broadcast (rb)
Eventsâ Request: <rbBroadcast, m>â Indication: <rbDeliver, src, m>
Properties: RB1, RB2, RB3, RB4
23
Reliable broadcast (rb)
Propertiesâ RB1 = BEB1.â RB2 = BEB2.â RB3 = BEB3.â RB4. Agreement: For any message m, if a
correct process delivers m, then everycorrect process delivers m
27
Uniform broadcast (urb)
Eventsâ Request: <urbBroadcast, m>â Indication: <urbDeliver, src, m>
Properties: URB1, URB2, URB3, URB4
28
Uniform broadcast (urb)
Propertiesâ URB1 = BEB1.â URB2 = BEB2.â URB3 = BEB3.â URB4. Uniform Agreement: For any
message m, if a process delivers m, thenevery correct process delivers m
29
Uniform reliable broadcast
m1
m1
crashm2
delivery delivery
m2
crash
delivery delivery
deliverydeliveryp1
p2
p3
31
Algorithm (beb)
Implements: BestEffortBroadcast (beb).
Uses: PerfectLinks (pp2p).
upon event < bebBroadcast, m> doforall pi in S do
trigger < pp2pSend, pi, m>;
upon event < pp2pDeliver, pi, m> dotrigger < bebDeliver, pi, m>;
34
Algorithm (rb)
Implements: ReliableBroadcast (rb).
Uses:BestEffortBroadcast (beb).PerfectFailureDetector (P).
upon event < Init > dodelivered := â ; correct := S;forall pi â S do from[pi] := empty;
35
Algorithm (rb â contâd)
upon event < rbBroadcast, m> dodelivered := delivered U {m}; trigger < rbDeliver, self, m>;trigger < bebBroadcast, [Data,self,m]>;
upon event < crash, pi > docorrect := correct \ {pi};forall [pj,m] â from[pi] do
trigger < bebBroadcast,[Data,pj,m]>;
36
Algorithm (rb â contâd)
upon event < bebDeliver, pi, [Data,pj,m]> do if m â delivered then
delivered := delivered U {m}; trigger < rbDeliver, pj, m>; if pi â correct then
trigger < bebBroadcast,[Data,pj,m]>; else
from[pi] := from[pi] U {[pj,m]};
39
Algorithm (urb)
Implements: uniformBroadcast (urb).
Uses:BestEffortBroadcast (beb).
PerfectFailureDetector (P).
40
Algorithm (urb)
upon event < Init > docorrect := S;
delivered := forward := empty;
ack[Message] := â ;
upon event < crash, pi > docorrect := correct \ {pi};
41
Algorithm (urb â contâd)
upon event < urbBroadcast, m> doforward := forward U {[self,m]};
trigger < bebBroadcast, [Data,self,m]>;
upon event (for any [pj,m] in forward) <correct âack[m]> and <m â delivered> dodelivered := delivered U {m};
trigger < urbDeliver, pj, m>;
44
Roadmap
(1) Overview and basic abstractionsâ Overviewâ Broadcastâ Consensus
(2) Advanced abstractionsâ Motivation: database replicationâ Atomic commitâ Total order broadcast
45
Consensus
In the consensus problem, the processes propose values and have to agree on one among these values Solving consensus is key to solving many problems in distributed computing (e.g., total order broadcast, atomic commit, terminating reliable broadcast)
46
Consensus
C1. Validity: Any value decided is a value proposed C2. Agreement: No two correct processes decide differently C3. Termination: Every correct process eventually decidesC4. Integrity: No process decides twice
48
Uniform consensus
C1. Validity: Any value decided is a value proposed C2â. Uniform Agreement: No two processes decide differently C3. Termination: Every correct process eventually decidesC4. Integrity: No process decides twice
51
Algorithm (consensus)
The processes go through rounds incrementally (1 to n): in each round, the process with the id corresponding to that round is the leader of the roundThe leader of a round decides its current proposal and broadcasts it to allA process that is not leader in a round waits (a) to deliver the proposal of the leader in that round to adopt it, or (b) to suspect the leader
52
Algorithm (consensus)
Implements: Consensus (cons).
Uses: BestEffortBroadcast (beb).PerfectFailureDetector (P).
upon event < Init > dodetected := â ; round := 1; currentProposal := nil;
53
Algorithm (consensus contÂŽd)
upon event < crash, pi > dodetected := detected U {pi};
upon event < Propose, v> and currentProposal= nil docurrentProposal := v;
upon event pround=self and currentProposalïżœnil dotrigger <Decide, currentProposal>; trigger <bebBroadcast, currentProposal>;
54
Algorithm (consensus contÂŽd)
upon event < bebDeliver, pround, value > docurrentProposal := value; round := round + 1;
upon event pround in detected do round := round + 1;
57
Algorithm (uCons)
Implements: UniformConsensus (uCons).
Uses:BestEffortBroadcast (beb).ReliableBroadcast(rb).PerfectFailureDetector (P).PerfectPointToPointLinks(pp2p);
58
Algorithm (uCons)
upon event < Init > dodetected := ackSet:= â ; round := 1; currentProposal := nil;
upon event < Propose, v> and currentProposal = nil docurrentProposal := v;
upon event < crash, pi > dodetected := detected U {pi};
59
Algorithm (uCons contÂŽd)
upon event pround=self and currentProposalïżœnil dotrigger <bebBroadcast, currentProposal>;
upon event < bebDeliver, pround, value > docurrentProposal := value;
trigger <pp2pSend, pround>; round := round + 1;
upon event pround in detected doround := round + 1;
60
Algorithm (uCons contÂŽd)
upon event < pp2pDeliver, pi, round > doackSet := ackSet U pi;
upon event ackSet âȘ detected = S dotrigger <rbBroadcast, Decided, currentProposal>;
upon event < rbDeliver, pi,Decided,v > dotrigger <Decide, v>;
61
Roadmap(1) Overview and basic abstractionsâ Overviewâ Broadcastâ Consensus
(2) Advanced abstractionsâ Motivation: database replicationâ Atomic commitâ Total order broadcast
62
Example: database replication
To show how to apply the abstractions presented before to build concrete applications
To introduce a number of relevant variants of the consensus abstractionâ Atomic commitâ Total order broadcast
63
Databases
Fundamental building block of todayÂŽsinformation systemsRepository of informationData accessed in the context of a transaction.
64
Transactions
Sequence of operations on data that is executed as an atomic unit.Operations:â Read itemâ Write item
Sequenced of operations bounded by begin_transaction and end_transactionoperators.
65
Transactions
Example: T1â begin_transaction
âą local_variable = Read (A);âą local_variable = local_variable *2âą Write (A)âą Write (B)
â end_transaction
66
Transactions: ACID properties
Atomicityâ All operations execute or none executes
Consistencyâ Transactions leave the database in a consistent
stateIsolationâ Transactions execute as if there is no other
transaction accessing the databaseDurabilityâ Transaction results persist across failures
67
Transactions: serializability
The result of a concurrent execution of transactions must be equivalent to the execution of those transactions in some serial order
Ta Tc Tx Td Th
68
Trasactions: serializability
Transactions that do not conflict may be executed concurrentlyOrder must be enforced on transactions that conflictâ i.e., that access the same data items
Serializability ensured by concurrency control mechanismâ Two-phase locking, timestamps, etc
69
Transactions: two-phase locking
When a transaction accesses a data item, it acquires a lockâ Read lockâ Write lock
If the item is already locked, the transaction waitsWhen the transaction terminates, all locks acquired are released
70
Transactions: termination
If the transaction concludes all its operations with success, the transaction commitsIf the transaction cannot conclude, the transaction aborts
71
Transactions: deadlockT1 Reads (A) T1 Writes (B)
A is locked! T1 waits for T2T2 Reads (B) T2 Writes (A)
B is locked! T2 waits for T1
Both transactions are waiting for each other!Usually resolved by timeout and by aborting one or both transactionsâ (it is also possible to execute a deadlock
detection algorithm)
72
Database replication
If there is a single copy of the database, updates may be lost if there is a failure in that copy:â Hardware failureâ Natural disasterâ Enemy attack
73
Database replication
Solution: database replication!â To keep multiple copies of the databaseâ Ensure that all updates are performed at
every copyâą Zero loss!
74
Database replication: pros and cons
Disadvantageâ Update operations are more expensive
âą Need to update all copiesAdvantagesâ Read operations are more efficient
âą Data is localâ Non-conflicting operations may potentially be
executed by different copiesFor certain loads, replication can improve the performance of the databaseâ In addition to provide fault-tolerance
75
Database replication: how to?
A protocol to ensure that all replicas are updatedâ Updates are not instantaneous
Protocol to ensure that replication is transparent to clientsâ Consistency protocolâ One-copy equivalence
76
NaĂŻve approach
Transform any write operation in a sequence of write operations (one for each copy of the item)â For instance:
âą Write (A) -> Write (A1); Write (A2); Write (A3)
Read any of the copies Use local concurrency control at each replica to detect conflicts
77
ExampleA1 = 1 (V=1)
B1 = 3 (V=1)
A1 = 1 (V=1)
B1 = 2 (V=2)begin_transaction Replica 1local_variable = Read (A);local_variable = local_variable *2Write (A)
A2 = 1 (V=1)
B2 = 3 (V=1)
A2 = 1 (V=1)
B2 = 2 (V=2)end_transaction Replica 2
local_variable = 0 1local_variable = 2local_variable =A3 = 1 (V=1)
B3 = 3 (V=1)
A3 = 1 (V=1)
B3 = 2 (V=2)Replica 3
78
Distributed transaction
Transaction updates items (more precisely replicas of items) at different copies.Transaction must be atomicâ All replicas are updated or no replica is
updatedDistributed atomic commitment!
79
Roadmap(1) Overview and basic abstractionsâ Overviewâ Broadcastâ Consensus
(2) Advanced abstractionsâ Motivation: database replicationâ Atomic commitâ Total order broadcast
80
Distributed atomic commitment
Variant of the consensus abstraction presented before.Processes propose one of the following values:â OKâ NOK
Result:â Commit if all nodes propose OKâ Abort it at least one node proposes NOK or if at
least one node fails
81
Non-Blocking Atomic Commit
NBAC1. Agreement: No two processes decide differently NBAC2. Termination: Every correct processeventually decidesNBAC3. Commit-Validity: 1 can only bedecided if all processes propose 1NBAC4. Abort-Validity: 0 can only bedecided if some process crashes or votes 0
82
Non-blocking atomic commitment
Propose: OK/NOK Decide: Commit/Abort
Atomic commitment
Propose: Commit/Abort Decide: Commit/Abort
Consensus
83
Algorithm (NBAC)
upon event < nbacPropose | v >trigger < bebBroadcast | v >
upon event < bebDeliver | p, NOK >trigger < ucPropose | Abort >
upon event < crash | p >trigger < ucPropose | Abort>
84
Algorithm (NBAC contÂŽd)
upon event < bebDeliver | p, OK >if all processes have sent OK then
trigger < ucPropose | Commit >
upon event < ucDecide | result >trigger < nbacDecide | result >
85
Non-blocking atomic commitment
decide (commit) decide (commit) decide (commit)
[ok,ok,ok][ok,ok,ok] [ok,ok,ok]
Propose (commit)Propose (commit)Propose (commit)
Consensus
86
Non-blocking atomic commitment
decide (abort) decide (abort) decide (abort)
[ok,ok,nok][ok,ok,nok] [ok,ok,nok]
Propose (abort)Propose (abort)Propose (abort)
Consensus
87
Non-blocking atomic commitment
decide (commit) decide (commit)
[ok,ok,ok] [ok,ok,?]
Propose (abort)
Propose (commit)
Consensus
88
Consensus based atomic commitment vs two-phase commit
Two-phase commit is a widely used protocol that implements atomic commitmentIt is coordinator basedIf the coordinator fails, unlike the solution above, two-phase commits blocks!
91
NBAC enough?
Read-one/ write-all + distributed atomic commitment can ensure correctness of replicated databaseHowever, this solution may suffer from performance problemsâ When multiple transactions try to access
the same data
93
Replication and deadlock
Even with a single data item, replication may cause a distributed deadlockThis may occur if different transactions lock different replicas in different ordersSolution:â Ensure that all replicas are accessed in the
same order!
94
Roadmap(1) Overview and basic abstractionsâ Overviewâ Broadcastâ Consensus
(2) Advanced abstractionsâ Motivation: database replicationâ Atomic commitâ Total order broadcast
95
Total order broadcast
Broadcast primitive that ensures that all processes receive the same set of messages in exactly the same orderAlso called atomic broadcastCan also be expressed as a variant of consensusâ Processes need to agree on the order they
will deliver messages
96
Total order
Validity: If pi and pj are correct, then every message broadcast by pi is eventually delivered by pj
No duplication: No message is delivered more thanonce
No creation: No message is delivered unless it wasbroadcast
(Uniform) Agreement: For any message m. If a correct (any) process delivers m, then every correct process delivers m
97
Total order (contâd)
(Uniform) Total order:
Let m and mâ be any two messages.
Let pi be any (correct) process that deliversm without having delivered mâ
Then no (correct) process delivers mâ beforem
98
Total order broadcast: algorithm
Each process keeps two sets of messages:â Unorderedâ Ordered
Messages are first disseminated using an (unordered) uniform reliable broadcast primitive:â This ensures that all processes receive the same
messages.â Messages received this way are placed in the
unordered set.
99
Total order broadcast: algorithm
How to move from the unordered to the ordered set?â Use a sequence of consensus
First instance of consensus decides which messages will take sequence number 1Second instace of consensus decides which messages will take sequence number 2Etc
100
Total order broadcast: algorithm
while (true) dowait until unordered set is not emptysn := sn + 1decision = propose (sn, unordered);deliver (decision) // in deterministic orderunordered := unordered \ decisionordered := ordered + decision
102
Algorithm (toBroadcast)
Implements: TotalOrder (to).
Uses:ReliableBroadcast (rb).
Consensus (cons);
upon event < Init > dounordered = delivered = empty;
wait := false;
sn := 1;
103
Algorithm (toBroadcast)
upon event < toBroadcast, m> dotrigger < rbBroadcast, m>;
upon event <rbDeliver,sm,m> and (m not in delivered) dounordered := unordered U {(sm,m)};
upon (unordered not empty) and not(wait) dowait := true: trigger < Propose, unordered>sn;
104
Algorithm (toBroadcast)
upon event <Decide,decided>sn dounordered := unordered \ decided;ordered := deterministicSort(decided);for all (sm,m) in ordered:
trigger < toDeliver,sm,m>; ÂŽ delivered := delivered U {m};
sn : = sn + 1; wait := false;
105
Replicated state-machine
The replicated state-machine approach is a fundamental technique to replicated deterministic serversA server is characterized by a state-machine if, at any point, it state depends only of:â Its initial stateâ The sequence of (deterministic) commands it has
executed
106
Replicated state-machine
A state-machine can be replicated by:â Running a replica of the state machine in a
different nodeâ Make all replicas have the same initial
stateâ Distribute all commands using a total order
broadcast
108
Database state-machine (DBSM)
Change the state machine scheduler to ensure deterministic processing of SQL commandsBroadcast SQL commands in total order to all replicas
109
Database state-machine (DBSM, middleware solution)
To avoid changing the databaseâŠIntroduce replicated proxy between clients and the database.Clients send SQL commands to proxies using atomic broadcastEach proxy submits write operations, one by one, to the attached database.
110
Disadvantage of DBSM
Requires the execution of a total order broadcast every time the client executes a SQL command.Excessive use of total order broadcast may limit the throughput of the system.
111
Optimistic approaches
Transactions are executed in a single replica without explicit synchronization with the remaining replicas until commit time.At commit time, control information is sent in a single atomic broadcast to all replicas.Based on the control information, all replicas apply a deterministic certification step.The certification step checks if the transaction is serializable.â Depending on the control information, the certification step
may require further communication.â The certification step decides the fate of the transaction
(commit or abort).
112
Optimistic approach (non-voting)
Transactions execute locally in a node called the delegate node.â Different transactions may execute in different delegate
nodes.â Locks are acquired locally at the delegate node.
When the transaction is ready to commit, read and write sets are sent to all replicas using atomic broadcast.Total order defines precedence order for concurrent transaction.â If read set is no longer valid, all replicas abort.â If read set is still valid, all replicas commit and apply the write
set.
113
Optimistic approach (voting)
(same as before)When the transaction is ready to commit, write set is sent to all replicas using atomic broadcast.Total order defines precedence order for concurrent transaction.Delegate node checks the read set.â If read set is no longer valid, decides abort.â If read set is still valid, decides commit.
Decision is broadcast to all replicas.
114
Optimistic approach
R1 R2 R34 45 5
Local execution
Commits/Aborts
Delegate certificates transaction
145
2 3 4 5Total OrderBroadcast Decision
115
Database replication in the real-world
Most commercial solutions only offer lazy-replicationâ Updates are applied in a master copy and
propagated later to other copiesâ Does not ensure strong consistencyâ Updates may be lost or become
unavailable
116
Database replication in the real-world
Due to the distributed deadlock problem, synchronous replication typically works as follows:â All updates are serialized in a single replica and
immediately copied to the backupsâ Two-phase commit is used
Disadvantages:â Suffer from the blocking problemâ To not take advantage of multiple nodes for
processing updates
117
Database replication in the real-world
Recent approaches take advantages of the abstractions presented in this tutorial:â Emic offers a commercial solution for
MySQL that implements the state machine approach
â INRIA has developed a middleware solution for clusters of databases using a similar approach
118
Concluding remark (1)
Some people tend to dismiss the use of distributed programming abstractions due to performance costs.In fact, some of the abstraction described in this tutorial are expensive.There are known lower bounds for the number of steps required to solve some problems:â The cost is not an artifact of the implementation
119
Concluding remark (1)
Therefore:â If you do not need an abstraction, donât use
it!Always use the weaker abstract needed by your application:â Check if it possible to weaken the
requirements.On the other handâŠ
120
Concluding remark (1)
If you do need an abstraction, use it!â The problem is not going to vanish just
because you decide to implement the solution entangled with the application code.
â Lower bounds will still be there.â Avoid the âdenial trapâ.
121
Concluding remark (1)
The implementation of these abstractions is difficultâ There are many subtle issues, some of
them have been addressed in this tutorial.The use of a modular approach simplifies the task of optimizing the performance of the systemâ By selecting the most appropriate
implementation of the abstraction.
122
Concluding remark (2)
Historically, abstractions have been firstexperimented as libraries ofprogramming languages and thenbecame, in some form, constructs ofthe languageâ Example: from concurrency libraries of
C++ to synchronized methods in Java
123
Concluding remark (2)
We provide a library or distributedprogramming abstractions: AppiaAppia is the descendent of Isis, Horus, Amoeba, Transis, Totem, Arjuna, Bast, PsyncWhat about a (reliable) distributedprogramming language?