字幕記錄


00:00
all right today's topic is distributed
00:03
transactions and these come in really to
00:15
implementation pieces and that's how
00:17
I'll cover them the first big piece of
00:20
concurrency control the second is atomic
00:28
commit and the reason why distributive
00:35
transactions come up is that it's very
00:36
frequent for people with large amounts
00:38
of data to end up splitting or sharding
00:41
the data over many different servers so
00:43
maybe if you're running a bank for
00:45
example the bank balances for half your
00:49
customers are one server and the bank
00:51
balances for the other half are on a
00:53
different server let's do it like split
00:54
the load both the processing load and
00:56
the space requirements this comes up for
01:00
other things too maybe you're recording
01:02
vote counts on articles at a website you
01:05
know the maybe there's so many millions
01:07
millions of articles half the vote
01:08
counts are and are on one server and
01:11
half the vote cancer or another but some
01:15
operations require touching modifying or
01:18
reading data on multiple different
01:20
servers so if we're doing a bank
01:21
transfer from one customer into another
01:23
well their balances may be on different
01:25
servers and therefore in order to do the
01:27
balance we have to modify data read and
01:29
write data on two different servers and
01:33
we'd really like to or one way building
01:37
these systems and we'll see others later
01:39
on in the course one way to build the
01:40
system just try to hide the complexity
01:43
of splitting this data across multiple
01:46
servers try to hide it from the
01:47
application programmer and this is like
01:51
traditionally has been a database
01:53
concern for for many decades and so a
01:56
lot of today's material originated with
01:58
databases but the ideas have been used
02:00
much more widely in distributed systems
02:03
which you wouldn't necessarily call a
02:04
traditional database the way people sort
02:09
of usually package up concurrency
02:13
control plus
02:16
atomic commit is in abstraction called a
02:23
transaction which we've seen before and
02:30
the idea is that the programmer you know
02:33
has a bunch of different operations may
02:36
be on different records in the database
02:37
they'd like all those operations to be
02:40
sort of a single unit and not split by
02:43
failures or by observation from other
02:45
activities and the transaction
02:50
processing system will require the
02:52
programmer to mark the beginning and the
02:53
end of that sequence of reading and
02:56
writing and updating operations in order
02:58
to mark the beginning and of the
02:59
transaction and the transaction
03:00
processing system has certainly will
03:03
provide certain guarantees about what
03:05
happens between the beginning and the
03:07
end
03:07
so for example supposing we're running
03:11
our bank and we want to do a transfer
03:15
from account of user X to the account of
03:19
user wide now these balances from both
03:21
of them start out as 10 so initially
03:23
expose 10 y equals 10 and x and y I'm
03:30
mean to be records in a database and we
03:35
want to transfer we will actually
03:38
imagine that there's two transactions
03:40
that might be running at the same time
03:41
one to transfer a dollar from account X
03:44
to account Y and the other transaction
03:47
to do an audit of of all the accounts at
03:49
the bank to make sure that the total
03:51
amount of money in the bank never
03:53
changes because after all if you do
03:54
transfers you know the total shouldn't
03:56
change even if you move money between
03:58
accounts in order to express this with
04:01
transactions we might have two
04:04
transactions the first transaction call
04:07
it t1 is the transfer well mark the
04:10
programmer is expected to mark the
04:12
beginning of it with the begin
04:13
transaction which all right at the
04:17
beginning and then the operations on the
04:21
two balances on the two records in the
04:23
database so we might add
04:25
[Music]
04:27
one might add one the balance X and add
04:34
-1 to Y and then we need to mark the end
04:42
by the transaction currently we might
04:46
have a transaction that's going to check
04:49
all the balance do an audit of all the
04:50
balances find the sum or look at all the
04:52
balances make sure they add up to the
04:54
number that doesn't change despite
04:56
transfers so the second transaction I'm
04:59
thinking about the audit transaction
05:05
also we need to mark the beginning and
05:07
end this time we're just reading there's
05:13
a read-only transaction we need to get
05:16
the current balances of all the accounts
05:19
lists they were just these two accounts
05:21
for now so we have two temporary
05:24
variables we're gonna read the first one
05:27
it's going to be the value of balance X
05:32
just right get to mean we're reading
05:34
that record we also read Y and we print
05:40
them both and that's the end of the
05:48
transaction the question is what are
05:55
legal results from these two
05:57
transactions that's the first thing we
05:59
want to establish is what are you know
06:01
given the starting state namely the two
06:03
balances for ten dollars and what could
06:06
be the final results after you've run
06:07
both these transactions maybe at the
06:09
same time so we need a notion of what
06:12
would be correct and once we know that
06:14
we need to be able to build machinery
06:17
that will actually be able to execute
06:18
these transactions and get only those
06:23
correct answers despite concurrency and
06:25
failures so first what's correctness
06:28
well databases usually have a notion of
06:33
correctness called acid
06:38
or bb-8 is acid and it stands for atomic
06:44
and this means that a transaction that
06:48
has multiple steps
06:49
you know maybe writes multiple different
06:50
records if there's a failure despite
06:53
failures either all of the right should
06:55
be done or none of them it shouldn't be
06:58
the case that a failure at an awkward
07:00
time in the middle of a transaction
07:01
should leave half the updates completed
07:04
invisible and half the updates never
07:06
done it's all or nothing so this is or
07:16
not despite failures the C stands for
07:25
consistent it's actually we're not going
07:32
to worry about that that's usually meant
07:35
to refer to the fact that database will
07:38
enforce certain invariants declared by
07:40
the application it's not really our
07:43
concern today the I though it's quite
07:45
important it usually stands for isolated
07:49
and this is a really a property of
07:52
whether or not two transactions that run
07:54
at the same time can see each other's
07:56
changes before the transactions have
07:58
finished whether or not they can see
08:00
sort of intermediate updates and from
08:02
the middle of another transaction and
08:06
your goal is no and the sort of
08:11
technical specific thing that most
08:14
people generally mean by isolation is
08:17
that the transaction execution is
08:19
serializable and I'll explain what that
08:21
means in a bit but it boils down to
08:27
transactions can't see each other's
08:29
changes can't see intermediate states
08:32
but only complete transaction results
08:34
and the final D stands for durable
08:39
and this means that after a transaction
08:42
commits after the client or whatever
08:44
program that submitted the transaction
08:46
gets a reply back from the database
08:49
saying yes
08:49
you know we've executed your transaction
08:52
the D in acid means that the
08:56
transactions modifications the database
08:58
will be durable that they'll still be
08:59
there they won't be erased by a some
09:02
sort of failure and in practice that
09:06
means that stuff has to be written into
09:08
some non-volatile storage persistent
09:10
storage like a disk and so today you are
09:13
in fact for this whole course really our
09:16
concerns are going to revolve around
09:18
good behavior with respect to failure
09:21
good respect good behavior with respect
09:25
to other from multiple parallel
09:28
activities and making sure that the data
09:31
is there still they are after even if
09:35
something crashes so the most
09:40
interesting part of this for us is the
09:42
specific definition of ice of isolated
09:44
or serializable so I'm going to lay that
09:48
out before before talking about how it
09:51
actually applies to these transactions
09:52
so the ioan isolated is usually and the
10:03
definition for this if a set of
10:06
transactions executes you know
10:10
concurrently more or less at the same
10:12
time they you are the set of results and
10:16
here the results refer to both the new
10:19
database records created by any
10:21
modifications the transactions might do
10:23
and in addition any output that the
10:26
transaction is produced so broader
10:28
transactions these two adds since they
10:30
change records their needs change
10:32
records are part of the results and the
10:34
output of this print statement is part
10:35
of the results so the definition of
10:38
serializable says the results are
10:42
serializable
10:47
if there exists some order of execution
10:55
of the transactions so we're gonna say a
11:20
specific execution parallel concurrent
11:23
execution of transactions is
11:24
serializable if there exists some serial
11:28
order really emphasizing serial here a
11:31
serial order of execution of those same
11:34
transactions that yields the same result
11:37
as the actual execution and the
11:39
difference of here is the actual
11:40
execution may have had a lot of
11:41
parallelism in it but it's required to
11:46
produce the same result as some one at a
11:48
time
11:49
execution of the same transactions and
11:52
so the way you check whether an
11:54
execution is serializable whether some
11:56
concurrent execution is serializable is
11:59
you look at the results and see if you
12:01
can find actually some one at a time
12:03
execution of the same transactions that
12:06
does produce the same results so for our
12:09
transaction up here there's only two
12:14
orders there's only two one at a time
12:16
serial orders available transaction 1
12:19
then transaction 2 or transaction 2 then
12:22
transaction 1 and so we can just look at
12:25
the results that they would produce if
12:27
executed one at a time in each of these
12:29
orders so if we execute t1 and then t2
12:34
then we get x equals 11
12:42
why equals 9 and this print statement
12:46
since t1 executed first this print
12:49
statement sees these two updated values
12:51
and so it will print the string 11 9 the
12:58
other possible order is that perhaps t2
13:01
ran first and then t1 and in that case
13:06
t2 will see that 2 records before they
13:09
were modified but the modifications will
13:11
still take place since t1 runs later so
13:14
the final results will again be x equals
13:16
11 y equal 9 but this time t2 sodded
13:22
before our values so these are the two
13:27
legal results for serializability and if
13:33
we ever see anything else from running
13:34
these two transactions at the same time
13:36
we'll know that the database were
13:38
running against does not provide
13:39
serializable execution it's doing
13:42
something else and so while we're
13:45
thinking through what would happen if or
13:48
what would happen if will always be
13:50
against these AHA these are the only two
13:52
legal results we better be doing
13:55
something that produces one or the other
13:56
it's interesting to note that there's
13:59
more than one possible result depending
14:02
on the actual order you if you you
14:04
submit these two transactions at the
14:06
same time you don't know whether it's
14:08
gonna be t1 t2 or t2 t1 so you have to
14:11
be willing to expect more than one
14:13
possible legal result and as you have
14:15
more or transactions running
14:16
concurrently a more complicated there
14:18
may be many many possible different
14:20
correct results that are all
14:22
serializable because of many many orders
14:25
here that could be used to fulfill this
14:28
requirement okay so now that we have a
14:34
definition of correctness and we even
14:35
know what all the possible results are
14:37
we can ask a few questions so few
14:42
what-if questions about how these could
14:44
execute so for example suppose that the
14:48
way the system actually executed this
14:50
was that it started transaction 2 and
14:53
got as far as
14:55
just after reading X and then
14:58
transaction one ran at this point and
15:03
then after transaction one finished
15:05
transaction to continue executing now it
15:11
turns out in with different other
15:13
transactions than this that might
15:15
actually be legal but here we want to
15:18
know if it's legal so we're wondering
15:20
gosh if we actually executed that way
15:22
what results will we get and are they
15:23
the same as either of these two well if
15:27
we execute transaction one here then t1
15:29
is gonna see value 10 t2 is gonna see
15:32
the value after decrementing Y so t1
15:35
will be 10 t2 will be 9 and what this
15:38
print will be 10 9 and that's neither of
15:42
these two outputs here so that means
15:45
executing in this way that I just drew
15:47
is not serializable it would not be
15:49
legal another interesting question is
15:55
what if we started executing transaction
15:57
1 and we got as far as just after the
15:59
first ad and then at that point all the
16:02
transaction 2 executed right here so
16:08
that would mean at this point X is value
16:10
11 the transaction 2 would read 1110 now
16:17
print 1110 and 1110 is not one of these
16:20
two legal values so this execution is
16:22
also not legal for these two
16:23
transactions
16:35
so the reason why serializable
16:39
serializability is a popular and useful
16:43
definition of what it means for
16:44
transactions to be correct for execution
16:47
of transactions to be correct is that
16:48
it's a very easy model for programmers
16:51
you can write complicated transactions
16:53
without having to worry about what else
16:56
may be running in the system there may
16:57
be lots of other transactions may be
16:59
using the same date as you may be
17:00
reading trying to read and write it at
17:02
the same time there might be failures
17:04
who knows but the guarantee here is that
17:10
it's safe to write your transactions as
17:12
if nothing else was happening because
17:15
the final results have to be as if your
17:19
transaction was executed by itself in
17:22
this one-at-a-time order which is a very
17:24
simple very nice programming model it's
17:28
also nice that this definition allows
17:31
truly parallel execution of transactions
17:34
as long as they don't use the same data
17:36
so we run into trouble here because
17:38
these two transactions are both reading
17:39
x and y but if they were using
17:41
completely disjoint database records
17:44
they could it turns out this definition
17:46
allows you to build a database system
17:48
that would execute transactions to use
17:51
disjoint data completely in parallel and
17:54
if you are a sharded system which is
17:56
what we're sort of working up to today
17:57
with the data different data is on
17:59
different machines you can get true
18:01
parallel speed-up because maybe one
18:02
transaction executes Spira in the first
18:04
shard on the first machine and the other
18:06
in parallel on the second machine so
18:09
there are opportunities here for for
18:12
good performance before I dig into how
18:17
to implement serializable transactions
18:21
there's one more small point I want to
18:24
bring up it turns out that one of the
18:29
things we need to be able to cope with
18:30
is that transactions may for one reason
18:33
or another
18:34
basically fail or decide to fail in the
18:39
middle of the transaction and this is
18:41
usually called an abort and you know for
18:47
many transaction systems we need to be
18:48
prepared to handle Oh what should happen
18:50
if a transaction tries to access a
18:53
record that doesn't exist or divides by
18:56
zero or maybe you know since some
18:59
transaction implementation schemes use
19:01
locking maybe a transaction causes a
19:03
locking deadlock and the only way to
19:05
break the deadlock is to kill one of one
19:08
or more of the transactions this
19:09
participating in the deadlock so one of
19:13
the things that's going to be kind of
19:14
hanging in the background and will come
19:16
up is the necessity of coping with
19:18
transactions that all of a sudden in the
19:20
middle decide they just cannot proceed
19:22
and you know maybe really in the middle
19:26
after they've done some work and started
19:28
modifying things we need to be able to
19:30
kind of back out of these transactions
19:33
and undo any modifications they've made
19:35
all right
19:38
the implementation strategy for
19:40
transactions for these asset
19:42
transactions I'm gonna split into two
19:46
big pieces but and talk about both of
19:48
them the main topics in the lecture the
19:52
first big implementation topic is
19:54
concurrency control this is the main
20:05
tool we use to provide serializability
20:07
the current or isolation so concurrency
20:10
control bias
20:16
by its isolation from other concurrent
20:19
transactions that might be trying to use
20:21
the same data and the other big pieces I
20:24
mentioned is atomic commit and this is
20:28
what's going to help us deal with the
20:31
possibility that oh yeah this
20:34
transactions executing a long and it's
20:35
may be modified X and then all of a
20:38
sudden there's a failure and one of the
20:41
server's involved but other servers that
20:44
were maybe actually in other parts of
20:45
the transaction that is if x and y are
20:48
in different machines we need to be able
20:50
to recover even if there's a partial
20:52
failure of only some of the machines the
20:55
transactions running off and the big
20:58
tool people use for that is this atomic
21:00
commit you'll talk about all right so
21:03
first concurrency control there's really
21:08
two classes two major approaches to
21:11
concurrency control I'll talk about both
21:13
during the course if they're just mean
21:20
strategies the first strategy is a
21:22
pessimistic usually called pessimist
21:29
pessimistic concurrency control and this
21:32
is usually locking we've all done
21:34
locking in the labs in the context of go
21:36
program so it turns out databases
21:38
transaction processing systems also used
21:40
locking and the idea here is U is the
21:45
same as well you're quite familiar with
21:46
this that before transaction uses any
21:48
data it needs to acquire a lock on that
21:50
data and if some other transactions
21:52
already using the data the lock will be
21:54
held and we'll have to wait before we
21:57
can acquire the lock wait for the other
21:58
transaction to finish and in pessimistic
22:02
systems if there's locking conflicts
22:04
somebody else has the lock it'll cause
22:06
delays so you're sort of treating
22:09
performance for correctness the other
22:14
main approach is optimistic approaches
22:21
the basic idea here is you don't worry
22:23
about whether maybe some other
22:25
transactions reading or writing the data
22:26
at the same time as you you just go
22:28
ahead and do whatever reads and writes
22:30
you're gonna do although typically into
22:32
some sort of temporary area and then
22:33
only at the end you go and check whether
22:37
actually maybe some other transaction
22:38
might have been interfering and if
22:40
there's no other transaction now you're
22:42
done and you never had to go through any
22:44
of the overhead or weighting of taking
22:46
out locks the locks are reasonably
22:47
expensive to manipulate but if somebody
22:51
else was modifying the data in a
22:54
conflicting way at the same time you
22:56
were then you have to abort that
22:58
transaction and we try and the
23:05
abbreviation for this is often
23:06
optimistic concurrency control um it
23:10
turns out that under different
23:11
circumstances these two strategies one
23:12
can be faster than the other
23:15
if conflicts are very frequent you
23:17
probably actually want to use
23:18
pessimistic concurrency control not
23:20
because of conflicts are frequent you're
23:22
gonna get a lot of aborts due to
23:23
conflicts for optimistic seems if
23:25
complex are rare than optimistic
23:27
concurrency control can be faster
23:29
because it completely avoids locking
23:31
overhead today will be all about
23:33
pessimistic concurrency control and then
23:36
some later paper in particular farm in a
23:39
couple weeks we'll deal with an
23:41
optimistic scheme okay so today talking
23:48
about pessimistic schemes refers
23:51
basically to locking and in particular
23:53
for today the reading was about
23:54
two-phase locking which is the most
23:57
common type of locking
24:07
and the idea in two-phase locking for
24:10
transactions is that transactions gonna
24:12
use a bunch of Records like X&Y and our
24:14
example the first rule is that you
24:19
acquire a lock before using date any
24:25
piece of data we're reading or writing
24:30
any record and the second rule for
24:35
transactions is that a transaction must
24:37
hold any locks it acquires until after
24:40
it commits or aborts you're not allowed
24:43
to give up locks in the middle of the
24:44
transaction you have to hold them all
24:46
you can only accumulate them until
24:48
you're done until after you're done so
24:54
until Phoebe done so this is two-phase
24:59
locking the phases are the phases which
25:01
we acquire locks and then phase in which
25:03
we just hold onto them until we're done
25:07
so for two phase locking to sort of see
25:11
why locking works your typical locking
25:15
systems well there's a lot of variation
25:17
typical locking systems associate a
25:19
separate lock with each record in the
25:21
database with each row in each table for
25:23
example although they can be more more
25:25
coarse-grained these transactions start
25:28
out holding no locks let's say
25:29
transaction one starts out holding no
25:31
locks when it first uses X before so
25:34
I'll have to use it it has to acquire
25:35
the lock on X and it may have to wait
25:38
and when it first uses Y it acquires
25:41
another lock the lock on Y when it
25:43
finishes after it's done becoming these
25:45
both if we ran both these transactions
25:48
at the same time they're gonna basically
25:50
race to get the lock on X and whichever
25:53
of them gets the managed to get the lock
25:56
on X first it will proceed and finish
25:59
and commit meantime the other
26:02
transaction that didn't manage to get
26:04
the lock on X first it's going to see if
26:05
you're waiting before it what you does
26:08
anything with accent OA can acquire the
26:10
lock so transaction 2 actually got the
26:13
lock first
26:14
you would get the value of X get the
26:16
value of y cuz transaction one hasn't
26:20
gotten at this point hasn't locked Y yet
26:22
it'll print and it will finish and
26:24
release its locks and only then
26:26
transaction one will be able to acquire
26:28
the lock on X and as you can see that
26:30
basically forces a serial order because
26:33
it forced in this case it force the
26:35
order T two and then when T two finishes
26:38
only then T 1 so with it's explicitly
26:43
forcing an order which causes the that
26:47
execution to follow the definition of
26:49
serializability that you know really is
26:51
executing T 2 to completion and only
26:54
then T 1 so we do get correct execution
27:06
all right so one question is why you
27:15
need to hold the locks until the
27:16
transactions completely finished you
27:19
might think that you could just hold a
27:22
lock while you are actually using the
27:24
data and that would be more efficient
27:25
and indeed it would that is you know
27:28
maybe only hold the lock for the period
27:31
of time in which t2 is actually looking
27:34
at record X or maybe only hold the lock
27:36
on X here for the duration of the add
27:39
operation and then immediately release
27:41
it and in that case that what if we
27:43
transaction one immediately released a
27:45
lock on X there there by disobeying this
27:47
rule of course but if it immediately
27:48
release the lock on X then transaction
27:50
two might be able to start a little bit
27:51
earlier we get more concurrency more
27:53
higher performance so this rule
27:55
definitely you know bad for performance
27:57
so we want to make pretty sure that it's
28:00
it's good for that's required for
28:02
correctness
28:05
so what won't happen if transactions did
28:08
actually release locks as early as
28:11
possible
28:12
so suppose t2 here reads X and then
28:15
immediately releases this lock on X that
28:20
would allow t1 since at now at this
28:23
point in the execution t2 doesn't hold
28:26
any locks because it's just released it
28:28
illegally release the lock on X since it
28:31
holds a no locks that means t1 could
28:33
completely execute right here and we
28:36
already knew from from before that this
28:40
interleaving is not correct as it
28:42
doesn't produce either these two outputs
28:45
similarly if if t1 released this lock on
28:52
X after finished adding one to X that
28:55
would allow all of t2 to slip in right
28:57
here and we know also from before that
28:59
that results in in illegal results
29:07
there's a an additional kind of problem
29:12
that can come up with releasing locks
29:14
after modifying data if t1 were to
29:18
release the lock on X it might allow t2
29:21
to see the modified version of X here to
29:24
see the X after adding 1 to it and to
29:26
print that output and then for tteyuu to
29:28
complete after printing the incremented
29:31
value of x if transaction one were to
29:33
abort after that point maybe because
29:36
bank balance Y doesn't exist or maybe
29:39
bank bonds Y exists but its balance is
29:41
zero and you know we're not allowed to
29:43
decrement 0 for bank balances because
29:46
that's an overdraft so t1 might modify X
29:48
then abort and part of the abort has to
29:51
be undoing its update to X in order to
29:56
maintain atomicity and what that would
29:59
mean if it released the locks is that
30:00
transaction 2 would have seen this sort
30:03
of phantom value of 11 that went away
30:05
because t1 aborted you would have seen a
30:08
value that according to the rules never
30:10
existed right because then the
30:13
transaction 1 aborts then it's as if it
30:16
never existed and so that means the
30:18
results from t2 had better be as if t2
30:21
ran by itself without t1 at all but if
30:24
it sees the increment that it's gonna
30:26
print 11 for X 11 10 actually which is
30:31
just doesn't correspond to any state in
30:34
the database given that t1 didn't really
30:37
complete okay so that's why those are
30:42
two dangers that are averted due to
30:45
violations serialize ability that are
30:48
averted because transactions hold the
30:50
locks until they're done a further thing
30:56
to note about these rules or that it's
30:59
very easy for them to produce deadlock
31:01
so you know for example if we have two
31:06
transactions one of them reads record ax
31:12
and reads record y
31:15
and the other transaction reads Y and
31:19
then X that's that's just a deadlock if
31:26
they run at the same time they each of
31:28
them gets this lock on the record it
31:32
first read they don't release till the
31:34
transactions finish so they both sit
31:37
there waiting for the lock that's held
31:39
by the other transaction and unless the
31:41
database does something clever which it
31:42
will
31:44
they'll deadlock forever and in fact
31:46
transactions have various strategies
31:47
including tracing cycles or timeouts in
31:50
order to detect that they've gone into
31:53
the situation the database will abort
31:54
one of these two transactions and undo
31:56
all its changes and act as if that
31:58
transaction that never occurred okay so
32:02
that's concurrency control with
32:04
two-phase locking and this is just
32:12
completely standard database behavior so
32:16
far and it's the same in a single
32:22
machine databases as it will be and
32:24
distributed databases that are a little
32:26
more interest to us but our next topic
32:30
is a little is actually specific to
32:32
building databases or storage systems in
32:35
general that support transactions on
32:39
distributed setting that is splitting
32:41
the data over multiple machines so now
32:45
the topic is how to build distributed
32:47
distributed transactions and in
32:53
particular how to cope with failures and
32:56
more specifically the kind of partial
32:58
failures of just one of many servers
33:00
that you often see in distributed
33:02
systems so beyond distributed
33:04
transactions and we're worried about how
33:07
they behave you make sure they're
33:09
serializable and also have sort of
33:13
all-or-nothing ad Amissah T even in the
33:15
face of failures so
33:21
you know I you know what the way this
33:24
looks like is that we may have two
33:26
servers and we got server one and maybe
33:30
it stores record X in our bank and we
33:33
have server two and maybe it's stores
33:35
record Y so they all start out with
33:37
value 10 and we need to run these two
33:41
transactions that transaction 1 of
33:44
course modifies both x and y so now we
33:48
need to send messages the database is
33:49
saying oh please increment X please
33:51
decrement Y but it would be easy if we
33:55
weren't careful to get into a situation
33:56
where we had told server 1 to increase
33:59
the balance for X but then something
34:01
failed maybe the client sending the
34:03
requests or maybe server the server -
34:05
that's holding Y fails or something and
34:07
we never managed to do the second update
34:10
right so that's one problem is failure
34:14
somewhere may sort of cut the
34:16
transaction in half and if we're not
34:19
careful cause only half of the
34:20
transaction to actually take effect
34:34
this can happen even without crashes if
34:36
X does its part in the transaction it
34:39
could be that over on server-to-server
34:40
to actually gets the request to
34:42
decrement bank account y but maybe
34:46
server 2 discovers this bank account
34:47
doesn't exist or maybe it does exist and
34:50
it's balance is already 0 when it can't
34:52
be decrease and so it can't do its part
34:53
of the transaction but X look has
34:56
already done its part of the transaction
34:58
so that's a problem that needs to be
35:00
dealt with so the the property we want
35:09
as I mentioned before is that all the
35:11
pieces of the system either all the
35:13
pieces of the system should do their
35:15
part of the transaction or none right so
35:18
you know the kind of the thing we
35:20
violated here is what atomicity against
35:25
crashes versus failure where atomicity
35:33
is all or not all parts all parts of the
35:40
transaction that we're trying to execute
35:42
or none of them and for you more the
35:50
kind of solution we're going to be
35:51
looking at is atomic commitments atomic
35:54
commit protocols and the general kind of
35:59
flavor of atomic commit protocols is
36:01
that you have a bunch of computers
36:02
they're all doing different parts of
36:04
some larger task and the atomic commit
36:08
protocol is gonna help the computers
36:10
decide that either they're all going to
36:12
do they're they're all capable of doing
36:13
their part and they're actually gonna do
36:15
it or something has gone wrong and
36:17
they're all going to agree that oh
36:19
they're actually none of them are gonna
36:21
do their part of the whatever the
36:23
overall task is and the big challenges
36:26
are of course how to cope with various
36:28
failures machine failures loss of
36:29
messages and it'll turn out that
36:32
performance is also a little bit
36:35
difficult to do a good job with the
36:39
specific protocol we're gonna look at
36:40
and is the protocol explained in a
36:42
reading for today our two-phase commit
36:52
this is an atomic commitment protocol
36:58
and this is used both by distributed
37:00
databases and also by all kinds of other
37:02
distributed systems that might not have
37:05
first looked like traditional databases
37:07
the general setting is we assume that
37:10
that in one way or another the task we
37:13
need to perform is split up over
37:15
multiple servers each of which needs to
37:16
do some part a different part each one
37:19
of them so for example because I'm set
37:22
up I showed here in which the it's
37:24
really the data that split up and so the
37:26
tasks being split up our incrementing X
37:28
and decrementing Y D we're going to
37:34
assume that there's one computer that's
37:38
driving the transaction called the
37:40
transaction coordinator there's lots of
37:55
ways of arranging how the transaction
37:57
coordinator steps in but we'll just
37:59
imagine it as a computer that is
38:00
actually running the transaction there's
38:03
one computer the transaction coordinator
38:04
that's that's executing the sort of code
38:06
for the transaction like the puts and
38:08
the gets and the adds and it sends
38:11
messages to the computers that hold the
38:14
different pieces of data that need to
38:16
actually execute the different parts so
38:18
for our setup we're going to have one
38:21
computer of the transaction coordinator
38:23
and it's going to be these server one
38:28
and server two that hold X&Y transaction
38:33
coordinator we'll send a message to
38:34
server one saying oh please increment X
38:36
send a message to server Y saying oh
38:38
please decrement Y and then there'll be
38:40
more messages in order to make sure that
38:42
either they both do it or neither than
38:44
do it and that's where two-phase commit
38:46
steps in something to keep in the back
38:50
your mind is that in the full system
38:52
there may be many different transactions
38:53
running concurrently and many
38:55
transaction coordinators
38:57
sort of executing their own transactions
39:00
and so the various parties here need to
39:03
keep track of oh you know this is a
39:04
message for such-and-such a transaction
39:06
and where they keep state like these
39:09
turns out these servers are going to
39:10
maintain table two blocks for example
39:12
and they keep state like that they need
39:14
to keep track of oh this is a lock
39:15
that's being held for transactions 17 so
39:18
there's a notion of transaction IDs and
39:28
I'm just gonna assume although you know
39:31
I'm not actually show it that every
39:33
message in the system is tagged with the
39:35
transaction with the unique transaction
39:37
ID of the transaction it applies to and
39:39
these IDs are chosen by the transaction
39:41
coordinator when the transaction starts
39:43
the transaction coordinator will send
39:44
out oh this is a message for transaction
39:47
1995 and it'll keep all its state here
39:51
about the transaction will be tagged
39:52
with 95 and the various tables in the
39:57
different participants in the
39:59
transaction will be tagged with the
40:01
transaction IDs and so that's another
40:04
piece of terminology we got the
40:05
transaction coordinator and then the
40:07
other servers that are doing parts of
40:11
the transaction are called participants
40:20
all right
40:21
so let me draw out the two-phase commit
40:24
protocol example execution so this is
40:28
abbreviate this to PC for two-phase
40:32
commit the parties involved are the
40:37
transaction coordinator and we'll just
40:40
say there's two participants that is you
40:42
know maybe we're executing the
40:43
transactions I've shown next and why
40:44
aren't different servers maybe we've got
40:48
participant a and participant B these
40:53
are two different servers holding data
40:57
so the transaction coordinator it's
40:59
running the whole transaction it's it's
41:01
gonna send puts and gets to a and B to
41:03
tell them to you know read the value of
41:06
x or y or add one to X so we're going to
41:09
see at the beginning of the tree
41:11
action that the transaction coordinator
41:12
is sending for example maybe a get
41:15
requests to Trent participant a and it
41:19
gets a reply and then maybe it sends
41:21
that put for whatever I might see a long
41:27
sequence of these if there's a
41:29
complicated transaction then when
41:33
transaction coordinator gets to the end
41:35
of the transaction and wants to commit
41:38
it and be able to you know release all
41:40
those locks and make the transactions
41:42
results visible to the outside world and
41:44
maybe reply to a client or a human user
41:47
so they were assuming there's a sort of
41:49
external client or human that said oh
41:52
please run this transaction and it's
41:54
waiting for a response before we can do
41:56
any of that the transaction coordinate
41:59
coordinator has to make sure that all
42:02
the different participants can actually
42:04
do their part of the transaction and in
42:07
particular if there were any puts in the
42:08
transaction we need to make sure that
42:11
the participants who are doing those
42:14
puts well are actually still capable of
42:16
doing the puts so in order to find that
42:19
out the transaction coordinator sends
42:22
prepare messages to all of the
42:32
participants so we're going to send pair
42:35
messages to both a and B
42:41
and when a or B would receive a preparer
42:44
message you know they know the
42:45
transaction is nearing completion but
42:47
not not over yet
42:49
they look at their state and decide
42:51
whether they are actually able to
42:52
complete the transaction you know maybe
42:54
they needed to abort it break a deadlock
42:56
or maybe they've crashed and we started
42:58
but between you know when they did the
43:02
last operation are now and they've
43:04
completely forgotten about the
43:05
transaction and can't complete it so a
43:07
and B you know look at their state and
43:08
say oh I'm going to be able to or I'm
43:10
not gonna be able to do this transaction
43:11
and they respond with either yes or no
43:24
so the transaction coordinator is
43:28
waiting for these yes or no votes from
43:31
each of the participants if they all say
43:35
yes then the transaction can commit
43:42
nothing goes wrong the transaction can
43:45
commit and the transaction coordinator
43:47
sends out a commit message to each of
43:57
the participants and then the
44:02
participants usually reply with an
44:05
acknowledgement saying yes we now know
44:07
the outcome this is called the echnology
44:10
all right so they all transaction
44:14
coordinator since I preparers if all the
44:17
participants say yes it can commit if
44:19
anyone in any of them even a single one
44:21
says no actually I cannot complete this
44:24
transaction because I had a failure or
44:27
there was an inconsistency like a
44:29
missing record and I have to abort even
44:32
a single participant says no at this
44:34
point then the transaction coordinator
44:36
won't commit it'll send out a round of
44:38
abort messages saying oops please
44:41
retract this transaction either way the
44:47
after the commit sort of to two things
44:51
happen of interest to us
44:52
one is the transaction coordinator will
44:54
mint whatever the transactions output is
44:57
to the client or human that requested it
44:59
and say look oh yes the transactions
45:00
finish and so now if it didn't abort a
45:03
committed it's durable
45:04
the other interesting thing is that in
45:07
order to obey these locking rules the
45:12
participants unlock when they see either
45:15
commit or an abort and indeed in order
45:23
to obey the two phase locking rule each
45:27
participant locked any data that it read
45:32
as part of doing its part of the
45:33
transaction so we're imagining that in
45:35
each participant there's a table of the
45:37
locks associated with the data stored at
45:40
that participant and the participant
45:43
sort of lock things in those tables
45:44
remember oh this is you know this piece
45:47
of data this record is locked for
45:49
transaction twenty nine and one finally
45:51
the commit or abort comes back versions
45:52
action twenty-nine the participant
45:55
unlocks that data and then other
45:57
transactions can use so we may have to
45:59
wait here and this unlock may unblock
46:02
other transactions that's really part of
46:06
the serializability machinery so you
46:13
know so far the reason why this is
46:14
correct basically is that the if
46:19
everybody's following this protocol
46:20
there's no failures then the two
46:23
participants only commit if both of them
46:25
commit and if I them can't commit if
46:29
I've them has to abort then they both
46:30
abort so we get that either they all do
46:33
it or none of them do it result that we
46:36
wanted the atomicity result with this
46:40
protocol so far without without thinking
46:43
about failures and so now our job is to
46:47
think through in our head all sort of
46:49
the different kinds of failures that
46:50
might occur and figure out whether the
46:53
protocol still provides atomicity either
46:56
both do it or neither do it in the face
46:59
of these failures and how we have to
47:01
adjust or extend the protocol in order
47:05
to cause it to do the right thing so the
47:07
first thing I want
47:08
consider is what it be crashes and
47:11
restarts
47:11
I mean power failure or something be
47:15
just some suddenly stops executing and
47:17
then powers restored and it's brought
47:20
back to life and run some maybe some
47:22
sort of recovery software as part of the
47:26
transaction processing system well
47:28
there's really two scenarios we have to
47:32
worry about one is B might have crashed
47:35
before ascending it's yes message back
47:41
so B crash before sending its yes
47:44
message back then it never said yes so
47:48
the transaction coordinator couldn't
47:50
possibly have committed or be about to
47:53
commit because it has to wait for a yes
47:55
from all participants so if B can
47:57
convince itself that it could not
47:58
possibly have sent a yes back that is a
48:02
crash before sending the yes then B is
48:04
entitled to unilaterally abort the
48:06
transaction itself and forget about it
48:09
because it knows the transaction
48:11
coordinator can't possibly commit it so
48:15
[Music]
48:18
there's you know a number of ways of
48:19
implementing this one possibility is
48:21
that all of these information about
48:23
transactions that haven't reached this
48:25
point is in memory and it simply lost it
48:27
B crashes and reboots so B just won't
48:30
know anything about transactions that
48:31
haven't haven't sent yes back yet and
48:35
then if the transaction coordinator
48:37
sends a prepare message to a participant
48:39
that doesn't know anything about the
48:41
transaction because it crashed before
48:42
sending yes the the parties will say no
48:45
no I cannot possibly agree to that you
48:47
know please abort
48:51
okay but of course maybe B crashed after
48:55
sending a yes back so that's a little
49:00
more tricky so wasn't in the crash
49:02
this wasn't a B gets a prepare its it's
49:05
happy it says yes I'm going to commit
49:07
and then it crashes before it gets the
49:09
commit message from the transaction
49:12
employer coordinator well now we had
49:14
we're in a totally different situation B
49:16
is promised to commit if told to do so
49:19
because the send a yes back and for all
49:21
knows and indeed the most likely thing
49:23
that's happening is the transaction
49:24
coordinator got yeses from a and B and a
49:26
sent a commit message to a so that a
49:28
actually will do its part of the
49:31
transaction and make it permanent and
49:32
release locks and in that case in order
49:35
to honor all or nothing we're absolutely
49:37
required it B should crash at this point
49:39
that on recovery that it be still
49:42
prepared to complete its part of the
49:44
transaction
49:45
it doesn't actually know at that point
49:46
whether you know because it hasn't
49:48
received the committee ette and whether
49:50
it should commit or not but it must
49:51
still be prepared to commit and what
49:53
that means the fact that we can't lose
49:57
the state for a transaction across
49:59
crashes and reboots
50:01
is that before B replies to a prepare it
50:07
must make the transaction state this
50:13
sort of intermediate transaction state
50:14
the memory of all of the changes that's
50:16
made which may have to be undone if
50:17
there's an abort plus the record of all
50:20
the locks the transactions how it held
50:22
it must make that durable on disk in
50:26
between it's almost always in a log on
50:28
disk so before B replies yes before B
50:33
sends the s4 in reply to a prepare
50:35
message it first must write to disk in
50:39
its log all the information required to
50:42
commit that transaction that is all the
50:44
new values produced by put plus a full
50:48
list of locks on the disk or some other
50:51
persistent memory before applying with
50:53
yes and then if there should be if it
50:55
B's your crash after sanity yes that's
50:58
part of recovery when it restarts that a
50:59
look at his it's log and say oh gosh I
51:01
was in the middle of a transaction I had
51:03
replied yes for transaction 92 I mean
51:06
you know here's all the modifications it
51:07
should make if committed and all the
51:09
locks it held
51:10
I better restore that state and then
51:13
when he finally gets a commitment nor an
51:15
abort it'll know from having read its
51:17
log how to actually finish its part of
51:20
the transaction so so this is an
51:23
important thing I left out of the
51:24
original laying out of this protocol is
51:29
that B must write to its disk at this
51:32
point
51:34
and this is part of what makes two-phase
51:36
commit a little bit slow is that there's
51:39
these necessary persisting of
51:41
information here okay so we also have to
51:47
worry about okay and you know the final
51:50
place I guess where you might crash is
51:51
you might crash be my crashed after
51:54
receiving the commit or or after both
51:58
you might crash after actually
52:00
processing the commit and but in that
52:02
case it's made modifications that the
52:06
transaction means to make permanent in
52:08
its database presumably also on disk
52:12
before after it received a commit
52:15
message and in that case there's maybe
52:16
not anything to do if it restarts
52:18
because the transaction is finished so
52:20
when B receives the commit message it
52:23
probably writes the copies the
52:28
modifications from its log on to its
52:29
permanent storage releases this locks
52:32
erases the information about the
52:34
transaction of months log and then
52:36
replies and of course we have to worry
52:38
about you know what if it receives a
52:40
commit message twice probably the right
52:43
thing to do is either for B to remember
52:45
about the transaction that takes memory
52:48
so it turns out that it B simply forgets
52:51
about committed transactions that it's
52:53
made durable on disk it can reply to a
52:56
repeated commit message if it doesn't
52:58
know anything about that transaction by
53:00
simply acknowledging it again and
53:03
that'll be an important a little bit
53:04
later on ok so that's the story of one
53:08
of the participants crashes at various
53:10
awkward points what about the
53:12
transaction coordinator it's also just a
53:14
single computer sorry you know if it
53:16
fails might be a problem okay so again
53:26
the critical where things start getting
53:29
critical is if any party might have
53:32
committed then we cannot forget about
53:36
that if any either of these participants
53:39
might have committed or if the
53:41
transaction coordinator might have
53:43
replied to the client then we cannot
53:47
have that transaction go away right if a
53:50
is committed but maybe its transaction
53:52
the coordinator sent out a commit
53:54
message to a but hadn't gotten around to
53:56
sending a commitment to be the crashes
53:58
at that point the transaction
53:59
coordinator must be prepared on restart
54:02
to resend the commit messages to make
54:05
sure that both parties know that the
54:08
transaction is committed so okay so you
54:14
know whether that matters depends on
54:16
where the transaction coordinator
54:17
crashes if the crash is before sending
54:20
commit messages it doesn't really matter
54:21
neither party if you know since the
54:24
transaction coordinator didn't send
54:26
commit messages before crashing it can
54:29
just abort the transaction and if either
54:33
participant asks about that transaction
54:35
because they you know see it's in their
54:36
log but they never got a commit message
54:38
the transaction coordinator can say I
54:40
don't know anything about that
54:41
transaction it must have been aborted
54:43
possibly due to a crash so that's what
54:46
happens if the transaction coordinator
54:47
crashes before the commit but if a
54:50
crashes after sending one or more
54:52
commits message then it cannot defends
54:59
action coordinator can't be allowed to
55:02
forget about the transaction and what
55:05
that means is that at this point when
55:08
that after the transaction coordinator
55:09
it's made its commit versus abort
55:11
decision on the basis of these yes/no
55:13
votes before sending out any commit
55:16
messages it must first write information
55:20
about the transaction to its login in
55:22
persistent storage like a disk that will
55:26
still be there if it crashes and
55:27
restarts so transaction coordinator
55:30
after receives a full set of yeses or
55:32
noes writes the outcome and the
55:35
transaction ID to its log on disk and
55:38
only then it starts to send out commit
55:40
messages and that way if a crash is at
55:41
any point maybe before its end the first
55:45
commit message or after its sent one or
55:47
maybe even after sent all of them if it
55:49
crashes that point its recovery software
55:51
will see in the log AHA which is in the
55:53
middle of a transaction the transaction
55:55
was either known to have been committed
55:57
or aborted
55:59
and as part of recovery it will resend
56:01
commit messages to all the participants
56:04
or abort messages whatever the decision
56:06
was in case it hadn't sent them before
56:10
it crashed and that's one reason why the
56:12
participants have to be prepared to
56:14
receive duplicated commit messages okay
56:27
so there's some other so those are the
56:31
main crash stories we also have to worry
56:34
about what happens if messages are lost
56:35
in the network you might send a message
56:37
maybe the message never got there you
56:39
might send a message and be waiting for
56:40
a reply maybe the reply was sent but the
56:44
reply was dropped so any one of these
56:45
messages may be dropped and need to
56:47
think through what to actually do in
56:52
each of these cases so for example
56:56
supposing the transaction coordinator
56:57
sent out prepare messages but hasn't
57:00
gotten some of the yes or no replies
57:02
from participants what are the
57:04
transaction coordinators options at that
57:06
point well one thing I could do is send
57:08
out a new set of prepare messages saying
57:11
you know I didn't get your answer please
57:13
tell me your answer yes or no and you
57:15
know I could keep on doing that for a
57:17
while but if one of the partisans is
57:20
down for a long time we don't want to
57:21
sit there waiting with locks held right
57:24
because you know supposing a is
57:27
unresponsive but but B is up but because
57:30
that we haven't committed or aborted B
57:31
is still holding locks and that may
57:33
cause other transactions to be waiting
57:35
so we don't want to wait forever if we
57:37
can possibly avoid it so if the
57:39
transaction coordinator hasn't gotten
57:41
yes or no responses after some amount of
57:43
time from the participants then it can
57:47
simply unilaterally decide we're gonna
57:49
abort this transaction because it knows
57:52
since it didn't get a full set of yes or
57:54
no messages of course that can't
57:55
possibly have sent a commit yet so no
57:57
participant could have committed so it's
58:00
always valid to abort if the transaction
58:03
coordinator hasn't yet committed so the
58:05
transaction coordinator times out
58:07
waiting for yes or no x' this messages
58:09
were lost or somebody crashed or
58:11
something
58:12
it can just decide alright we're
58:13
aborting this transaction we'll send out
58:15
a round of abort messages and if some
58:17
participant comes back to life and says
58:19
oh you know I didn't hear back from you
58:21
about transaction 95 the transaction
58:25
coordinator will see you oh well I don't
58:26
know anything about transaction 95
58:28
because it aborted it and erased its
58:30
State for that transaction and it will
58:32
tell the participant you know you should
58:35
abort this transaction too similarly if
58:42
one of the participants times out
58:44
waiting for the preparer here then you
58:47
know for participant hasn't received a
58:49
preparer that means it hasn't send a yes
58:51
message back and that means the
58:53
coordinator can't possibly have sent any
58:54
commit messages
58:55
so if participant chimes out here
58:58
waiting for the preparer it's also
58:59
always allowed to just bail out and
59:03
decide to abort the transaction and if
59:05
it's some future time the transaction
59:07
coordinator comes back to life and sends
59:09
out preparer messages then B will say no
59:11
I don't know anything about that
59:12
transaction so I'm voting no and that's
59:15
okay because it can't possibly have
59:16
committed started to commit anywhere so
59:19
again if something goes wrong with the
59:21
network or the transaction coordinator
59:22
is down for a while
59:24
and the participants are still waiting
59:26
for prepares it's always valid for
59:29
participants to abort and thereby
59:31
release the locks that other
59:32
transactions may be waiting for and that
59:34
can be very important in a busy system
59:39
so that's the good news about if the
59:44
participants or the transaction
59:45
coordinators time out waiting for
59:47
messages from the other parties however
59:52
suppose participant B has received a
59:56
preparer and sent its yes and so is in
60:00
somewhere around here but it hasn't
60:01
received a commit and it's waiting and
60:03
waiting and it hasn't gotten to commit
60:05
back maybe something's wrong with the
60:06
network maybe the transaction
60:08
coordinator is its network connection
60:10
has fallen out or its powers failed or
60:13
something but for whatever reason B is
60:14
waited a long time and it still hasn't
60:15
heard a commit now but it's sitting
60:18
there holding locks is still holding on
60:19
to those locks for all the records that
60:21
were used and it's part of the
60:22
transaction and that means other
60:24
transactions may be also
60:25
blocked waiting for those locks to be
60:27
released so we're like pretty eager to a
60:30
border if we possibly can or release the
60:32
locks and so the question is if B has
60:35
received prepare and replied with yes
60:37
isn't entitle to unilaterally abort
60:40
after it's waited say you know 10
60:42
seconds or 10 minutes or something to
60:45
get the commit message and the answer to
60:48
that unfortunately is no in this region
60:54
after receiving the prepare we're out
60:56
really after sending the yes and before
60:58
getting the commit it's your time out
61:01
waiting for the commit you're not
61:06
allowed to abort you must keep waiting
61:08
you must usually called block so in this
61:12
region of the protocol if you don't
61:14
receive the commit you have to wait
61:15
indefinitely and the reason is that
61:17
since be sent back a yes that means the
61:21
transaction coordinator may have
61:22
received the yes it may have received
61:24
yes from all of the participants and it
61:26
may have started sending out commit
61:28
messages to some of the participants and
61:30
that means that a may have actually seen
61:32
the commit message and committed and
61:34
made us changes permanent and unlocked
61:35
and showing the changes to other
61:37
transactions and since that could be the
61:39
case for all B knows in this region of
61:42
the protocol B cannot unilaterally
61:44
decide to abort at the times out it must
61:47
wait indefinitely to hear from the
61:49
transaction coordinator as long as it
61:51
takes some human may have to come and
61:54
repair the transaction coordinator and
61:56
finally get it started again and have it
61:57
read this log and see oh yes you
62:00
committed that transaction and finally
62:02
send long delayed commit messages so and
62:13
similarly if on a time I you can't you
62:23
can't unilaterally abort it turns out
62:25
you can't unilaterally commit either
62:27
because for all B knows a might have
62:29
voted no but he just hasn't got the
62:31
important message yet so you could in
62:33
this region you can either abort nor
62:35
commit
62:36
on a timeout and so this actually this
62:44
this blocking behavior is sort of
62:47
critical property of two-phase commit
62:51
and it's not a happy property
62:53
it means if things go wrong you can
62:56
easily be in the situation where you
62:58
have to wait for a long time with locks
62:59
held and holding up other transactions
63:01
and so among other things people try
63:05
really hard to make this part of
63:08
two-phase commit acts as fast as humanly
63:10
possible so that the window of time in
63:13
which a failure might cause you to block
63:17
with locks held for a long time is as
63:20
small as possible so they try to make
63:22
this part of the protocol very
63:23
lightweight or even have variants of the
63:26
protocols that for certain special cases
63:27
may not have to wait at all okay so
63:33
that's the basic protocol one thing to
63:37
notice about this that is a fundamental
63:41
part of why we're able to get to
63:44
actually build a protocol that allows a
63:46
and B to sort of both you know they both
63:49
commit or they both have or abort one
63:53
reason for that is that really the
63:54
decision is made by a single entity it's
63:56
made by the transaction coordinator
63:58
alone a and B are neither of them you
64:01
know except that they vote no neither a
64:05
nor B is deciding whether to commit or
64:09
not and they certainly are not engaged
64:11
in a conversation with each other to try
64:13
to reach agreement about what is the
64:15
other thinking or they thinking commit
64:17
may be all commit to instead we have
64:19
this much is quite sort of fundamentally
64:22
simple protocol in which only the
64:25
transaction coordinator makes the
64:27
decision a single entity and it just
64:29
tells the other party here's my decision
64:31
please go do it the penalty for that for
64:38
having the transaction coordinator
64:39
really the single entity make the final
64:42
decision again is the fact that you have
64:45
to block there's some points in which
64:46
you have to block waiting for the
64:47
transaction recording coordinator to
64:49
tell you what the decision
64:50
was one further question is that we know
64:58
the transaction coordinator must
64:59
remember information about transactions
65:02
and its log in case it crashes and so
65:05
one question is when the transaction
65:06
coordinator can forget about information
65:10
in its log about transactions and the
65:11
answer to that is that if it manages to
65:14
get a full set of acknowledgments from
65:16
the participants then it knows that all
65:18
the participants know that that
65:19
transaction committed or aborted that
65:22
all the transactions no participants
65:24
knew the fate of that transaction and
65:25
have done their part in it and will
65:27
never need to know that information
65:29
right as they both acknowledged it so
65:31
when the transaction coordinator gets
65:33
acknowledgements it can erase all
65:35
information all memory the transaction
65:39
similarly participants once they
65:42
received a commit or abort message and
65:44
done their part of the transaction and
65:46
made their updates permanent and
65:48
released their locks at that point the
65:50
participants also can completely forget
65:53
about that transaction after they send
65:57
their acknowledgment back to the
65:59
transaction coordinator now of course
66:01
the transaction coordinator may not get
66:03
their acknowledgement and may send and
66:05
may therefore decide to resend the
66:07
commit message on the theory that maybe
66:09
it was lost and in that case a
66:11
participant if it receives a commit
66:13
message for a transaction which it know
66:14
nothing about because it's forgotten
66:16
about it then the participant can just
66:21
send another acknowledgement back
66:22
because it knows that it gets a commit
66:25
message for an unknown transaction it
66:27
must be because it had forgotten about
66:28
it because it already knew whether it
66:30
committed or aborted okay so that's
66:37
two-phase commit for atomic commitment
66:41
for a little perspective two-phase
66:44
commit is used in a lot of sharded
66:47
databases that have split up their data
66:50
among multiple servers and it's used
66:54
specifically in databases or storage
66:58
systems that need to support
67:00
transactions in which records in which
67:03
multiple
67:03
records may be read or written there's a
67:06
lot of some more specialized storage
67:09
systems that don't allow you to have
67:12
transactions on multiple records and for
67:15
them you don't need it you no need this
67:17
kind of you don't need two-phase commit
67:18
if the storage system doesn't allow
67:22
multi record transactions but if you
67:24
have multi record transactions and you
67:26
shard the data across multiple servers
67:28
then you need to support either
67:30
toothpaste you need to support two in
67:31
pace commit if you want to get asset
67:34
transactions
67:36
however two-phase commit has an evil
67:39
reputation one reason is it's slow due
67:43
to multiple rounds of messages there's a
67:45
lot of chitchat here in order to get a
67:48
transaction that involves multiple
67:50
participants to finish theirs in
67:53
addition a lot of disk writes both a and
67:55
B have to not just write data to their
67:58
disk between the prepare and the sending
68:01
of the yes they have to wait for that
68:02
disk rate to finish so certainly if
68:04
you're using a mechanical Drive that
68:06
takes 10 milliseconds to append to the
68:09
log that puts a real serious limit on
68:11
how fast participants can process
68:14
transactions you know 10 milliseconds a
68:16
pop means no without some cleverness
68:19
you're limited to 100 transactions per
68:21
second which is pretty slow and in
68:23
addition the transaction coordinator
68:25
also has a point in which it must after
68:28
it receives the last yes they must first
68:30
write to its log make sure the data is
68:33
safe on disk and only then is that
68:35
allowed to send that commit messages and
68:38
that's another 10 milliseconds and both
68:41
of these are 10 millisecond periods in
68:43
which locks are held in the participants
68:45
and other transactions are slowed up and
68:47
I keep mentioning that but it's very
68:48
important because in a busy transaction
68:51
processing system there's lots and lots
68:53
of transactions and many of them may be
68:55
waiting for the same data and we'd
68:57
really prefer not to hold locks over
69:01
long periods of time in which there's
69:02
lots of messages going back and forth
69:04
then we have to wait for long disgrace
69:06
but two-phase commit forces us to do
69:09
those weights
69:13
and a further problem with it is that if
69:16
anything goes wrong messages are lost
69:18
something crashes then if you're not if
69:21
you're a little bit unlucky then the
69:23
participants have to wait for long times
69:25
with locks held
69:26
so therefore to face commit you really
69:30
only see it within relatively small
69:32
domains within a single machine room
69:34
within a single organization you don't
69:36
see it for example did you transfers
69:39
between banks between different banks
69:42
you might possibly see it within a bank
69:44
if it's charted its database but you
69:47
would never see two days can it run
69:48
between distinct organizations that were
69:52
maybe physically separate because of
69:53
this blocking business you don't want to
69:56
put the fate of you know your database
69:58
and whether it's operational in the
70:00
hands of some other organization where
70:02
they crash at the wrong time you're
70:04
forced your database was forced to hold
70:07
locks for a long time and because it's
70:12
so slow also there's a lot a lot of
70:15
research has gone into either making it
70:19
fast or relaxing the rules in various
70:21
ways to allow to be faster or
70:24
specializing two-phase commit for very
70:27
specific situations in which you know
70:31
you can shave a message or write to the
70:33
disk or something off it because you
70:34
know you're only supporting a certain
70:36
limited kind of transaction so well
70:39
we'll see fair amount of this and the
70:40
rest of the course one question that
70:45
comes up a lot this exchange here where
70:51
you have a leader essentially and it
70:53
sends these messages to the followers
70:56
and you know we can only go forward if
71:00
the leader can only proceed if it
71:02
receives you know acknowledgments
71:04
replies from enough of the followers
71:07
this looks a lot like raft this
71:11
construction looks a lot like raft
71:13
however the properties of the protocol
71:17
and what you get out of it turn out to
71:18
be quite different from what we get out
71:20
of raft they solve very different
71:24
problems
71:25
so the way to think about it is that you
71:28
use raft to get high-availability by
71:31
replicating data on multiple
71:34
participants on multiple peers that is
71:37
the point of raft is to be able to
71:39
operate even though some of the server's
71:42
involved have crashed or are not
71:44
reachable and you can do this in raft
71:47
raft can do this because all the service
71:49
are doing the same thing they're doing
71:51
the same thing so we don't need all of
71:53
them to participate we only need a
71:55
majority two-phase commit however the
72:00
participants are not at all doing the
72:02
same thing the participants are each
72:04
doing a different part of the
72:05
transaction you know a maybe
72:07
incrementing record X and B maybe
72:10
decrementing record Y so two-phase
72:13
commit all the train all the participant
72:17
they all have to do their part in order
72:20
for the transaction to finish you really
72:22
need to wait for every single one of the
72:24
participants to do their thing so okay
72:31
so we got you know raft is replicating
72:34
doesn't need everybody to do their thing
72:35
two-phase commit
72:37
everybody's doing something different
72:39
that has to get done two-phase commit
72:42
does not help at all with availability
72:44
you know raft is all about availability
72:46
you can go on even if some of the
72:48
participants are not responding
72:50
two-phase commit is actually not at all
72:54
available it's not highly available at
72:56
all if anything goes wrong we risk
72:58
having to wait until that's repaired if
73:00
the transaction coordinator crashes at
73:02
the wrong time we simply have to wait
73:03
for to come up and read its log and send
73:05
out the commit messages right if if one
73:08
of these participants you know crashes
73:11
at the wrong time you know if we're
73:12
lucky we simply have to abort then we're
73:15
not lucky we have to say did you finish
73:16
that did you finish that so two-phase
73:19
commit is not at all about high
73:21
availability in fact it's it's a it's
73:23
quite low availability as such things go
73:25
any crash can hold up the whole system
73:28
and of course raft doesn't ensure that
73:33
all the participants do whatever the
73:36
operation is it only requires a majority
73:38
there may be
73:39
minority that totally didn't do the
73:40
operation at all and that's how the fact
73:42
that raft all the participants do the
73:44
same thing we don't have to wait for all
73:46
of them is why raft gets high
73:47
availability so these are quite
73:51
different protocols um it is however
73:55
possible to to usefully combine them
73:58
like two-phase commit is you know really
74:01
vulnerable to failures it's correct with
74:04
failures but it's not available with the
74:06
others so the question is could you
74:08
build some sort of combined system that
74:12
has the high availability of RAF to
74:14
replication but has two phase commits
74:19
ability to call as various different
74:21
parties each to do their part of the
74:23
transaction and the construction you
74:25
want actually is to use raft or paxos or
74:27
some other protocol like that to rep
74:31
individually replicate each of the
74:33
different parties so then we would for
74:37
this set up we would have like three
74:39
different clusters the transaction
74:41
coordinator would actually be replicated
74:43
service with you know three servers and
74:50
you know we'd run raft on these three
74:53
servers one will be elected as leader
74:54
they'd have replicated state they'd have
74:56
a log that helped them replicate we
74:58
don't only have to wait for a majority
75:00
the leader we'd only have to have a
75:02
minority of these to be up in order for
75:04
the transaction coordinator to do its
75:06
work and of course they would all and
75:08
you know sort of execute through the
75:10
various stages of the transaction and
75:12
the two-phase commit protocol by
75:16
basically by appending relevant records
75:19
to their logs and then each of the
75:21
participants would also be a cluster of
75:25
a rep our raft replicated cluster
75:40
so we would end up and they would chain
75:43
exchange messages back and forth you
75:46
know we'd send a commit message from the
75:49
replicated transaction coordinator
75:51
service to the replicated a server and
75:53
the replicated B server and this is you
75:58
know this is admittedly somewhat
75:59
elaborate but it does show you that you
76:01
can combine these ideas to get the
76:03
combination of high availability because
76:05
any one of these servers can crash and
76:07
the remaining two you keep operating
76:09
plus we get on this atomic commitment of
76:12
a and B are doing complete different
76:14
parts of the same transaction and we can
76:17
use two-phase commit to have the
76:19
transaction coordinator ensure that you
76:21
know that either both commit the whole
76:22
thing or they both abort their parts of
76:25
the transaction you'll actually build
76:30
something very much like this as part of
76:33
lab form which you will indeed build a
76:35
shard a database where each shard is
76:37
replicated in this form and there's a
76:40
basically a configuration manager which
76:42
will allow essentially transactional
76:45
shifting of chunks of shards of data
76:48
from one raft cluster to another under
76:52
the control of something that looks a
76:55
lot like a transaction coordinator so
77:00
lab 4 is like this and in addition in a
77:05
little bit we'll be reading a paper
77:07
called spanner which describes a
77:08
real-life database used by Google that
77:11
users also uses this construction in
77:14
order to do transactional writes to a
77:16
database all right thank you