字幕記錄


00:01
let's imagine three servers with logs
00:07
that looked like this where the numbers
00:13
I'm writing are the term numbers of the
00:15
command that's in that log entry so we
00:17
don't really care what the actual
00:18
commands are then I got a number the log
00:20
slots and so let's imagine that the
00:35
presumably the the next term is term six
00:38
although you can't actually tell that
00:40
from looking the evidence on the board
00:42
but it must be at least six or greater
00:44
let's imagine that server s3 is chosen
00:47
as the leader for term six and at some
00:52
point s3 the new leader is going to want
00:56
to send out a new log entry so let's
00:58
suppose it wants to send out its first
01:00
log entry per term six so we're sort of
01:02
thinking about the append entries our
01:04
PCs that the leader is going to send out
01:06
to carry the first log entry for term
01:11
six really should be under slot thirteen
01:13
the rules in Figure two say that an
01:16
append entries are bc actually as - as
01:18
well as the command that the client sent
01:22
in to the leader that we want to
01:23
replicate on the logs of all the
01:25
followers there's this append entries
01:28
RPC also contains this previous log
01:33
index field and a previous log term
01:39
field and when we're sending out an end
01:44
pend entries for where this is the first
01:46
entry we're leaders supposed to put
01:48
information about the previous slot the
01:51
slot before the new information sending
01:54
out so in this case the log index of the
01:57
previous entry is 12 and the term of the
02:04
command in the leaders log for the
02:06
previous entry is by so sends out this
02:10
information to the followers
02:13
and the followers before they accept a
02:16
upend entries are supposed to check you
02:18
know they know they've received an
02:19
append entries that for some log entries
02:23
that start here and the first thing they
02:26
do is check that there are previous the
02:29
receiving followers check that their
02:30
previous log entry matches the previous
02:34
information that follow that the leader
02:36
sent out so for a server to of course it
02:38
doesn't match the server to has a entry
02:42
here all right but it's an entry from
02:43
term for not from turn five and so the
02:46
server twos going to reject this append
02:49
entries and sort of send a false reply
02:51
back leader and server one doesn't even
02:54
have anything here so server ones gonna
02:56
also reject the append entries in the
03:00
leader and so so far so good right the
03:03
terrible thing that that has been
03:05
averted at this point is you know the
03:08
bad thing we absolutely don't want to
03:09
see is that server to actually stuck the
03:12
new blog entry in here which would break
03:15
sort of inductive proofs essentially
03:20
that the figure to scheme relies on and
03:24
hide the fact that server two actually
03:27
had a different log so instead of
03:28
accepting log entry server two projects
03:30
this RPC the leader sees is two
03:34
rejections and leader is maintaining
03:36
this next index field one for each
03:38
follower so it has a next index for
03:45
server two and the leader has a next
03:48
index for server one presumably if the
03:54
should have said this before if the
03:56
server sending out information about
03:58
slot thirteen here that must mean that
04:01
the server's next index is for both of
04:03
these other servers this started out as
04:09
thirteen and that would be the case at
04:11
the server if this leader had just
04:13
restarted because the figure two rules
04:14
say that next index starts out at the
04:16
end of the new leaders log and so in
04:21
response to errors the leaders supposed
04:23
to decrement its next in
04:25
steal so it does that for both got
04:28
errors from boat deca mr. Calvin resends
04:34
and this time the server is going to
04:36
send out append entries with previous
04:39
log index equals 11
04:41
and previous log term equals 3 and this
04:47
new append entries has it has a
04:50
different previous log index but it's
04:53
the content in the log entries that the
04:55
server is going to send out this time
04:58
include you know all the entries after
05:01
that the new previous log index is
05:03
sending out so server 2 now the previous
05:06
log index 11 it looks there and it sees
05:10
a ha you know the term is 3 same as what
05:12
the reader is sending me so server 2 is
05:15
actually going to accept this append
05:17
entries and figure 2 rules say oh if you
05:20
accept a pendent we supposed to delete
05:22
everything in your log after where the
05:24
append entry starts and replace it with
05:26
whatever's in the append entries so
05:28
server tune is going to do that now it's
05:32
he just went to 5 6 server 1 still has a
05:35
problem cuz it has nothing at slot 11
05:38
middle would return another error the
05:41
server will now backup its server 1 next
05:47
index 2 11 it'll send out its log
05:51
starting here with the previous index
05:54
and term referring now to this slot and
05:56
this one's actually acceptable server 1
05:58
it'll adopt it'll accept the new log
06:00
entries and send a positive response
06:03
back to the server and now they're all
06:07
now they're all caught up and the
06:14
presumably the server also when it sees
06:17
that followers accepted and append
06:21
entries that had a certain number of log
06:22
entries it actually increments this next
06:24
index could be 14 4 alright so the net
06:29
effect of all this backing up is that
06:31
the server has used a backup mechanism
06:33
to detect the point at which the
06:36
followers logs
06:38
started to be equal to this
06:39
servers and then sent each of the
06:42
followers starting from that point that
06:43
a complete remainder of the server's log
06:47
after that last point at which they were
06:49
equal any questions all right
07:01
just to repeat discussion we've had
07:05
before and we'll probably have again you
07:07
notice that we erased some blog entries
07:09
here which are now su erase that I
07:11
forget what they were 4 & 5 so there
07:15
were some well actually that was mostly
07:18
remember we erased this log entry here
07:21
this used to say for um server - the
07:24
question is why was it ok for the system
07:27
to forget about this client command
07:29
right this thing we erased corresponds
07:32
to some client command which are now
07:33
throwing away I talked about this
07:36
yesterday what's the rationale here yeah
07:42
so it's not a majority of the servers
07:45
and therefore whatever previous leader
07:46
it was who sent this out couldn't have
07:49
gotten acknowledgments from a majority
07:51
of servers therefore that previous
07:52
leader couldn't have decided it was
07:55
committed couldn't have executed it and
07:57
applied it to the application state
07:58
could never have sent a positive reply
08:00
back to the client so because this isn't
08:03
done a majority of servers we know that
08:05
the client who send in and has no reason
08:07
to believe it was executed couldn't have
08:08
gotten a reply because one of the rules
08:10
is the server only sends over the leader
08:12
only sends a reply to a client after it
08:15
commits and executes so the client had
08:19
no reason to believe it was even
08:20
received by any server and then and the
08:23
rules of figure to basically say the
08:24
client if he gets no response after a
08:26
while it supposed to resend the request
08:27
so we know whatever request this was it
08:29
threw away we've never executed never
08:33
included in any state already and the
08:36
clients gonna resend it by-and-by yes
08:58
well it's always deleting suffix of the
09:03
followers log I mean in the end the sort
09:08
of backup answer to this is that the
09:10
leader has a complete log so all its
09:13
fails it can just send us complete log
09:16
to the follower and indeed if you know
09:19
if you've just started up the system and
09:20
something very strange happened even at
09:22
the very beginning then you may end up
09:24
actually you know maybe in some of the
09:26
tests for lab two you may end up backing
09:28
up to the very first entry and then
09:31
having the leader essentially send the
09:32
whole log but because the leader has
09:34
this whole law we know it could sort of
09:35
it's got all the information that's
09:37
required to feel everybody's logs if it
09:40
needs to
09:49
okay all right so in this example which
09:53
I guess are now erased we elected s3 as
09:57
the leader and the question is could we
10:02
you know who can we who are we allowed
10:04
to elect this leader right cool
10:08
you know that all right if you read the
10:10
paper you know the answer is not just
10:12
anyone it turns out it matters a lot for
10:15
the correctness the correctness of the
10:17
system that we don't allow just anyone
10:19
to be the leader like for example the
10:21
first node whose timer goes off may in
10:24
fact not be an acceptable leader and so
10:28
it turns out raft has some rules that
10:29
applies about oh yes you can be leader
10:32
or you can't be leader and to see why
10:35
this is true let's sort of set up a
10:37
straw man proposal that maybe raft
10:43
should accept should use the server with
10:46
the longest log as the leader right you
10:49
know some alternate universe that could
10:50
be true and it is actually true in
10:52
systems with different designs just not
10:55
in raft so the question we're
10:57
investigating is why not use the
11:02
cervical longest law as leader and this
11:09
would involve changing the voting rules
11:11
in raft have a voters only vote for
11:16
nodes that have longer logs all right so
11:19
the example that's going to be
11:21
convenient for showing why this is a bad
11:23
idea so let's imagine we have three
11:25
servers again and now the log set setups
11:29
are server Wan has entries for terms
11:34
five six and seven server two four five
11:38
and eight and server three also four
11:41
five and eight that's the first question
11:46
of course to avoid spending our time
11:50
scratching our heads about utter
11:51
nonsense is to make sure that convince
11:54
ourselves that this configuration could
11:56
actually arise because if it couldn't
11:58
possibly arise then
11:59
may be a waste of time to figure out
12:01
what would happen if it did arise so
12:03
anybody wanna propose a sequence of
12:07
events whereby this set of logs could
12:10
have arisen how about an argument that
12:17
it couldn't have arisen oh yeah okay so
12:27
well maybe we'll back up sometime
12:31
all right so server one wins is wins the
12:34
election at this point and it's in term
12:37
six sends out yeah it receives a client
12:42
request sends out the first append
12:44
entries and then that's fine actually
12:52
everything's fine so far nothing's wrong
12:56
all right well a good bet for all these
12:59
things is then it crashes right or it
13:01
receives the client requests in term six
13:04
it appends the client requests to its
13:06
own log which it does first and it's
13:08
about to send out a pen entries but it
13:10
crashes yes it didn't send out any pen
13:12
entries and then you know we need then
13:14
it crashes and restarts very quickly
13:16
there's a new election and gosh server
13:19
one is elected again as the as the new
13:22
leader it receives in term seven and
13:24
receives a client request appends it to
13:25
its log and then it crashes right and
13:33
then after after a crashes we have a new
13:35
election maybe server 2 gets elected
13:37
this time maybe server 1 is down now so
13:42
off the table if server 2 is elected at
13:45
this point suppose server 1 is still
13:48
dead what term is server what server two
13:51
venues
13:56
yeah eights the right answer so why
13:58
eight and not remember this you know
14:00
this is now gone why eight and not six
14:07
that's absolutely right so not written
14:09
on the board but in order for server one
14:11
to have been elected here it must have
14:12
votes from majority of nodes which
14:14
include at least one of server-to-server
14:16
three if you look at the vote request
14:21
code and figure two if you vote for
14:24
somebody you're you're supposed to
14:25
record the term in persistent storage
14:30
and that means that either server 2 or
14:32
server 3 are both new about term six and
14:35
in fact term seven and therefore when
14:38
sever one dies and they cannot elect a
14:39
new leader at least one of them knows
14:41
that the current term was eight if that
14:44
one and only that one actually if
14:48
there's only one of them only that one
14:49
could win an election because it has the
14:51
higher terminal birth they both know
14:52
about term eight sorry if they both know
14:53
about term seven then they'll both and
14:56
either one of them will try to be leader
14:57
and term eight so that fact of that the
15:00
next term must be term a dis is insured
15:02
by the property of the majorities must
15:05
overlap and the fact that current term
15:08
is updated by vote request and is
15:10
persistent and guarantee did not be lost
15:12
even if there were some crashes here so
15:14
the next term is going to be eight
15:15
server two or server three will win the
15:17
leadership election and let's just
15:19
imagine that whichever one it is sends
15:21
out append entries for a new client
15:25
requests the other one gets it and so
15:27
now we have this configuration right so
15:30
I was a bit of a detour we're back to
15:32
our original question of in this
15:36
configuration suppose server one revives
15:38
we have an election would it be okay to
15:42
use server one would it be okay to have
15:44
the rule be the longest log wins the
15:48
longest log gets to be the leader yeah
15:54
obviously not right because server was a
15:56
leader did it's going to force its log
16:04
on to the to followers by the append
16:07
entries machinery that we just talked
16:08
about a few minutes ago
16:09
if we live server one to be the leader
16:11
it's gonna you know sent out a pen
16:13
entries whatever backup overwrite these
16:15
aids tell the followers to erase their
16:19
log entries for term a to accept to
16:21
overwrite them with this six and seven
16:23
log entries and then to proceed now with
16:26
identical to server ones so of course
16:30
why are we upset about this yeah yeah
16:39
exactly it was already committed right
16:43
it's not a majority of servers has
16:45
already committed probably executed
16:50
quite possibly a reply sent to a client
16:52
so we're not entitled to delete it and
16:56
therefore server one cannot be allowed
17:00
to become leader and force its log onto
17:02
servers two and three everybody see why
17:06
that's bad idea for rapid and because of
17:12
that this can't possibly have been rule
17:15
for elections of course shortest log
17:19
didn't work too well either and so in
17:23
fact if you read forward to section
17:26
something five point four point one
17:32
draft actually has a slightly more
17:34
sophisticated election restriction that
17:43
the request vote handling RPC handling
17:47
code is supposed to check before it says
17:49
yes before votes yes for a different
17:51
peer and the rule is we only vote you
18:00
vote yes for some candidate who send us
18:02
over request votes only if candidate has
18:10
a higher
18:15
term in the last log entry or same last
18:26
term same charming the last log entry
18:31
and a log length that's greater than or
18:38
equal to the the server that received
18:44
that received the boat request and so if
18:51
we apply this here if server two gets a
18:53
vote request from server one there our
18:58
last log entry terms or seven the server
19:03
one's gonna send out a request votes
19:04
with a last entry term whatever of 7
19:08
server twos is eight so this isn't true
19:11
server service we didn't get a request
19:14
from somebody with a higher term in the
19:16
last entry and or the last entry terms
19:22
aren't the same either said the second
19:23
Clause doesn't apply either so neither
19:26
server to new serve nor server three is
19:28
going to vote for server one and so even
19:30
if it sends out this vote requests first
19:32
because this has a shorter election
19:33
timeout nobody's going to vote for it
19:35
except itself so I don't think it's one
19:36
vote it's not a majority if either
19:38
server two or server three becomes a
19:41
candidate then either of them will
19:44
accept the other because they have the
19:45
same last term number and their logs are
19:48
each greater than or equal to in length
19:50
and the others so either of them will
19:52
vote for for the other one will server
19:55
one vote for either of them
19:56
yes because either server 2 or server 3
20:00
has a higher term number in the last
20:01
entry so you know what this is doing is
20:05
making sure that you can only become a
20:07
candidate if or it prefers candidates
20:11
that knew about higher that have log
20:13
entries some higher terms that is it
20:15
prefers candidates that are more likely
20:17
to have been receiving log entries from
20:18
the previous leader and you know this
20:23
second part says well we were all
20:25
listening to the previous leader then
20:26
we're going to
20:27
for the server that has saw more
20:30
requests from the very last leader any
20:35
questions about the election restriction
20:45
okay final thing about sending out log
20:55
entries is that this rollback scheme at
20:59
least as I described it and it's as its
21:01
described in Figure two rolls back one
21:03
log entry at a time and you know
21:08
probably a lot of fun that's okay
21:09
but there are situations maybe in the
21:13
real world and definitely in the lab
21:15
tests where backing up one entry at a
21:18
time is going to take a long long time
21:20
and so the real-world situation where
21:22
that might be true is if they if a
21:25
follower has been down for a long time
21:27
and missed a lot of upend entries and
21:30
the leader restarts and if you follow
21:33
the pseudocode in Figure two if a leader
21:34
restarts is supposed to set its next
21:36
index to the end of the leaders log so
21:38
if the follower has been down and you
21:40
know miss the last thousand log entries
21:42
and leader reboots the leader is gonna
21:45
have to walk back off one at a time one
21:48
RPC at a time all thousand of those log
21:50
entries that the follower missed and
21:53
there's no you know particular reason
21:55
why this would never happen in real life
21:57
it could easily happen at somewhat more
22:01
contrived situation that the tests are
22:03
definitely explorers is if a follower is
22:07
if we say we have five servers and
22:09
there's there's a leader but the leaders
22:13
got trapped with one follower in a
22:17
network partition but the leader doesn't
22:18
know it's not leader anymore and it's
22:19
still sending out append entries to its
22:21
one follower and none of which are
22:23
committed while in the other majority
22:25
partition the system is continuing as
22:28
usual the ex leader and follower in that
22:32
Minority partition could end up putting
22:35
in their logs you know sort of unlimited
22:37
numbers of log entries for a stale term
22:40
that will never be committed and need to
22:42
be deleted and overwritten eventually
22:44
when they rejoin the main group that's
22:47
maybe a little less likely in the real
22:48
world but you'll see it happen and the
22:52
test set up so in order to be able to
22:55
back up faster that paper has
22:57
somewhat a vague description of a faster
22:59
scheme towards the end of section 5.3
23:05
it's a little bit hard to interpret so
23:07
I'm gonna try to explain what their
23:10
ideas about how to back up faster a
23:11
little bit better and the general idea
23:13
is to be able to to have the follower
23:15
send enough information to the leader
23:17
that the leader can jump back an entire
23:19
terms worth of entries that have to be
23:22
deleted per append entries so it leader
23:26
may only have to send one in a pennant
23:27
and append entries per term in which the
23:30
leader and follower disagree instead of
23:33
one per entry so there's three cases I
23:38
think are important and the fact is that
23:39
you can probably think of many different
23:42
log backup acceleration strategies and
23:46
here's one so I'm going to divide the
23:50
kinds of situations you might see into
23:51
three cases so this is fast backup case
24:01
one I'm just going to talk about one
24:06
follower and the leader and not worry
24:09
about the other nodes the same we have
24:12
two server one which is the follower and
24:18
server 2 which is the leader so this is
24:25
one case and here we need to backup over
24:29
a term where that term is entirely
24:31
missing from the leader another case
24:44
so in this case we need to back up over
24:47
some entries but their entries for a
24:49
term that the leader actually knows
24:50
about so apparently the this followers
24:53
saw a couple of entry a couple of the
24:56
very Flass few append entries sent out
24:58
by a leader that was about to crash but
25:01
the new leader didn't see them we still
25:03
need to back up over them and a third
25:05
case is where the followers entirely
25:11
missing the following the leader agree
25:15
but the followers is missing the end of
25:19
the leaders log and I believe you can
25:24
take care of all three of these with
25:27
three pieces of extra information in the
25:29
reply that a follower sends back to the
25:32
leader in the case in the append entries
25:35
so we're talking about the append
25:37
entries reply if the follower rejects
25:42
the append entries because the logs
25:43
don't agree there's three pieces of
25:46
information that will be useful and
25:47
taking care of three street cases I'll
25:49
call them X term which is the term of
25:55
the conflicting entry I remember the
25:58
leader sent this previous log term and
26:05
if the follower rejects it because it
26:07
has something here but the terms wrong
26:09
so it'll put the followers term for the
26:12
conflicting entry here or you know I'm
26:19
negative one or something it doesn't
26:21
have anything in the log there it'll
26:25
also send back the index of the
26:30
conflicting but the index are the first
26:33
entry with that term
26:46
and finally if there wasn't any log
26:49
entry there at all the follower will
26:52
send back on the length of its law like
26:56
the followers log so for case one the
27:02
way this helps if the it's a leader sees
27:12
that the leader doesn't even have an
27:16
entry with X term of term X term at all
27:19
in its log so that's this case where the
27:22
leader didn't have turn five and if the
27:24
leader can simply back up to the
27:26
beginning of the followers run of
27:30
entries with X term that is the the
27:34
leader can set its next index to this X
27:37
index thing which is the first entry the
27:41
followers run of items from term five
27:45
alright so if the leader doesn't have X
27:48
term at all it should back up to X back
27:51
the follower up to X index the second
27:53
case you can detect the fault the leader
27:55
can detect if X term is valid and the
27:59
leader actually has log entries of term
28:04
X term that's the case here where the
28:08
you know the disagreement is here but
28:10
the leader actually has some entries
28:12
that term in that case the leader should
28:14
back up to the last entry it has that
28:18
has the contesta followers term for the
28:22
conflicting term in it that is the last
28:24
entry that a leader has for term for in
28:26
this case and if neither of these two
28:29
cases hold that is the well actually if
28:33
the follower indicates by maybe setting
28:36
X term to minus one it actually didn't
28:37
have anything whatsoever at the
28:39
conflicting log and index because it's
28:41
log is too short then the leader should
28:46
back up its next index to the last entry
28:49
that the follower had at all and start
28:51
sending from there
28:53
and I'm telling you this because it'll
28:55
be useful for doing a lab and if you
29:00
miss some of my description it's it's in
29:03
electronics then any questions about
29:05
this backing up business Jack I think
29:20
that's true yeah yeah yeah maybe binary
29:25
search I'm not ruling out other
29:26
solutions I mean that you know after
29:29
reading the papers non description of
29:32
how to do it I like cook this up and
29:34
there's probably other ways to do this
29:37
probably better ways and faster ways of
29:39
doing it like I'm I'm sure that if
29:40
you're willing to send back more
29:41
information or have a more sophisticated
29:43
strategy like binary search you can do a
29:46
better job yeah well you you almost
29:50
certainly need to do something
29:53
experience suggests that in order to
29:55
pass the tests you'll need to do
29:57
something to as well probably not me
30:02
although I that's not quite true like
30:04
one of the solutions I've written over
30:06
the years actually does the stupid thing
30:08
and still passes the tests but because
30:11
the tests you know the one of the sort
30:15
of unfortunate but inevitable things
30:17
about the tests we give you is that they
30:19
have a bit of a real time requirement
30:21
that is the tests are not willing to
30:23
wait forever for your solution to
30:25
produce an answer so it is possible to
30:29
have a solution that's you know
30:30
technically correct but takes so long
30:33
that the tester gives up and
30:36
unfortunately you know we will the
30:39
tester will fail you if your solution
30:40
doesn't finish the test and whatever the
30:42
time limit is and therefore you do
30:44
actually have to pay some attention to
30:46
performance in order you know your
30:50
solution has to be both correct and have
30:52
enough performance to finish before the
30:54
tester gets bored and sometimes out on
30:56
you which is like 10 minutes or I don't
30:58
know what it is and unfortunately it's
31:00
relatively this stuff's complex enough
31:02
that it's not that hard to write a color
31:04
correction
31:05
that's not fast enough yes so the way
31:15
you can tap the leader can tell the
31:16
difference is that the follower we're
31:20
supposed to send back the term number it
31:23
sees in the conflicting entry you we
31:25
have case one if the leader does not
31:29
have that term in its log
31:31
so here the follower will set X term to
31:34
five to five because this is this is
31:37
going to be the this is gonna be the
31:39
conflicting entry the follower says this
31:44
X term to five the leader observes oh I
31:46
do not have term five in my log and
31:48
therefore this case one and you know it
31:57
should back up to the beginning
31:58
like it doesn't follower hasn't leader
32:00
has none of those and term five entry so
32:02
it should just get rid of all of them in
32:04
the follower by backing up to the
32:05
beginning which is X index yeah yeah
32:20
because the leaders gonna back up its
32:22
next index to here and then send an
32:25
append entries that starts here and the
32:28
rules a figure to say ah the follower
32:29
just has to replace its log so it is
32:31
gonna get rid of the fives okay alright
32:37
the next thing I want to talk about is
32:38
persistence you'll notice in Figure two
32:42
that the state in the upper left-hand
32:44
corners sort of divided and summer
32:47
marked persistent and some are marked
32:50
volatile and what's going on here is
32:54
that the the distinction between
32:57
persistence and volatile you know only
32:59
matters if a server reboots crashes and
33:03
restarts because the persistent what the
33:06
persistent means is that if you change
33:08
one of those items it's marked
33:09
persistent you're supposed to the server
33:14
supposed to write it to disk or to some
33:15
other non-volatile storage like as
33:17
or battery-backed something or whatever
33:20
that will ensure that if the server
33:23
restarts that it will be able to find
33:26
that information and sort of reload it
33:28
into memory and that's to allow us to
33:34
allow servers to be able to pick up
33:35
where they left off if they crash and
33:37
restart now you might think that it
33:46
would it would be sufficient and simpler
33:48
to say well if a server crashes then we
33:51
just throw it away and or we need to be
33:55
able to throw it away and replace it
33:57
with a brand-new empty server and bring
33:59
it up to speed right and of course you
34:01
do actually it is vital to be able to do
34:04
that right because if some server
34:06
suffers a failure of some catastrophic
34:08
failure like it's you know disk melts or
34:10
something you absolutely need to be able
34:14
to replace it and you cannot count on
34:17
getting anything useful off its disk if
34:18
something bad happened to its disk so we
34:20
absolutely need to be able to replace
34:22
completely replace servers that have no
34:24
state whatsoever you might think that's
34:28
sufficient to handle any difficulties
34:30
but it's actually not it turns out that
34:32
another common failure mode is power
34:34
failure of you know the entire cluster
34:38
where they all stop executing at the
34:40
same time right and in that case we
34:43
can't handle or we can't handle that
34:46
failure by simply throwing away the
34:48
servers and replacing them with new
34:50
hardware that we buy from Dell we
34:53
actually have to be able to get off the
34:56
ground we need to be able to get a copy
34:58
of the state back in order to keep
35:01
executing if we want our service to be
35:04
fault tolerant and therefore in order at
35:07
least in order to handle the situation
35:09
of simultaneous power failure we have to
35:11
have a way for the server's to sort of
35:13
save their state somewhere where it will
35:15
be available when the power returns and
35:19
that's one way of viewing what's going
35:20
on with persistence it said that's the
35:23
state that's required
35:26
to get a server going again I'm after
35:28
either a single power failure or power
35:31
failure of the entire cluster
35:33
alright so figure two this three items
35:38
only three items are persistent so
35:44
there's a log that's like all the log
35:49
entries current term and voted for and
36:03
by the way you know one of us server
36:06
reboots it actually has to make an
36:07
explicit check to make sure that these
36:09
data are valid on its disk before it
36:14
rejoins the raft cluster I have to have
36:17
some way of saying oh yeah I actually do
36:18
have some save persistent state as
36:20
opposed to a bunch of zeros that that
36:24
are not valid all right so the reason
36:28
why log has to be persisted is that at
36:34
least according to figure two this is
36:36
the only record of the application state
36:40
that is figure two doesn't really have a
36:42
notion fears two does not say that we
36:44
have to persist the application state so
36:46
if we're running a database or you know
36:48
a test and set service like for vmware
36:50
ft the actual database or the actual
36:53
value of the test and set flag isn't
36:55
persistent according to figure two only
36:57
the logins and so when the server
36:59
restarts the only information available
37:02
to reconstruct the application state is
37:05
the sequence of commands in the log and
37:08
so that has to be persisted that's what
37:13
about current term why does current term
37:17
have to be persistent
37:34
yeah so they're both about ensuring that
37:37
there's only one that each term has at
37:39
most one leader so yeah so voted for the
37:43
specific you know potential damaging
37:45
case is that if a server receives a boat
37:48
request and votes for server one and
37:50
then it crashes and if it didn't persist
37:53
this the identity of who had voted for
37:55
and in my crash we start get another
37:58
boat request for the same term from
37:59
server two and say gosh I haven't voted
38:01
for anybody because my voted for is
38:03
blank
38:03
now I'm gonna vote for server 2 and now
38:05
our servers voted for server 1 and for
38:08
server 2 in the same term and that might
38:12
allow two servers
38:14
since both server and server to voted
38:16
for themselves they both may think they
38:17
have a majority out of three and they're
38:19
both going to become leader now we have
38:20
two simultaneous servers for the same
38:23
term so this that's why I voted for it
38:24
has to be persistent current term is
38:28
gonna be a little more subtle but we
38:30
actually talked before about how you
38:34
know again we don't want to have more
38:36
than one server for a term and if we
38:38
don't know what term number it is then
38:41
we can't necessarily then it may be hard
38:44
to ensure that there's only one server
38:46
for a term and I think maybe in this
38:49
example ya if s if server 1 was down and
38:54
server 2 and server 3 we're gonna try to
38:57
elect a new server they need evidence
38:59
that the correct turn numbers 8 and not
39:02
6 right because if they if they forgot
39:04
about current term and it was just
39:06
server 2 and server 3 voting for each
39:08
other and they only had their log to
39:09
look at they might think the next term
39:10
should be term 6 they did that they
39:12
start producing stuff for term 6 but now
39:14
there's gonna be a lot of confusion
39:16
because we have two different term sixes
39:18
and so that's the reason my current term
39:21
has to be persistent to preserve
39:24
evidence about term numbers that have
39:27
already been used these have to be
39:34
persisted pretty much every time you
39:38
change them right so certainly the safe
39:42
thing to do is every time you add an
39:44
entry of log or change current term
39:46
are said voted for you need you probably
39:51
need to persist that and in a real raft
39:53
server that would mean writing it to the
39:54
disk so you'd have some set of files
39:56
that recorded this stuff you can
39:59
probably be a little bit you may be can
40:01
cut some corners if you observed that
40:04
you don't need to persist these things
40:08
until you communicate with the outside
40:09
world so there may be some opportunity
40:11
for a little bit of batching by saying
40:13
well we don't have to persist anything
40:14
until we're about to reply to an RPC or
40:17
about to send out an RPC I mean that may
40:20
allow you to avoid a few persisting x'
40:23
the reason that's important is that
40:27
writing stuff to disk is can be very
40:31
expensive it's a if it's a mechanical
40:32
hard drive that we're talking about then
40:34
writing anything you know if the way
40:37
we're persisting is writing files on the
40:38
disk writing anything on the disk cost
40:41
you about 10 milliseconds because you
40:43
either have to wait for the disk to spin
40:45
for the point you want to write to spin
40:47
under the head which disk only rotates
40:49
about once every 10 milliseconds or
40:51
worse that you may actually have to seek
40:53
to move the arm the right track right so
40:55
these per systems can be terribly
40:58
terribly expensive and if for sort of
41:01
any kind of straightforward design
41:03
they're likely to be the limiting factor
41:06
in performance because they mean that
41:09
doing anything anything whatsoever on
41:13
these graph servers takes ten
41:15
milliseconds a pop and 10 milliseconds
41:18
as far longer than it takes to say send
41:20
an RPC or almost anything else you might
41:23
do 10 milliseconds each means you can
41:26
just never if you persist data to a
41:29
mechanical drive you just can never
41:31
build a raft service it can serve more
41:33
than 100 requests per second because
41:37
that's what you get it at 10
41:38
milliseconds per operation and you know
41:41
this is this cost so this is really all
41:44
about cost of synchronous
41:49
just updates and it comes up in many
41:58
systems like file systems the file
41:59
systems that are running in your laptops
42:00
are that the designers spend a huge
42:03
amount of time sort of trying to
42:05
navigate around the performance problems
42:07
of synchronous disk up they think of as
42:09
disk writes because in order for stuff
42:10
to get safe on your disk in order to
42:12
update the file system on your laptop's
42:14
disk safely there turns out the file
42:18
system has to like be careful about how
42:20
it writes and needs to sometimes wait
42:22
for the disk to finish writing so this
42:25
is a like a cross-cutting issue in all
42:27
kinds of systems certainly comes up in
42:29
draft if you want it to build a system
42:33
they could serve more than a hundred
42:34
quests per second then there's a bunch
42:38
of options one is you can use a
42:39
solid-state drive or some kind of flash
42:41
or something solid eight drives can do a
42:44
write to the flash memory in maybe a
42:50
tenth of a millisecond so that's a
42:52
factor of a hundred for you or if you're
42:55
even more sophisticated maybe you can
42:57
build yourself battery backed DRAM and
43:02
do the persistence into battery back
43:03
DRAM and then if the server reboots hope
43:07
that reboot was took shorter than the
43:11
amount of time the battery lasts and
43:12
that this stuff you persisted is still
43:14
in the RAM and the reason I mean if you
43:17
have money and sophistication the reason
43:19
to favor that is you can write DRAM you
43:21
know millions of times per second and so
43:23
it's probably not going to be a
43:24
performance bottleneck anyway so that
43:28
this problem is why and it's sort of
43:33
marking a persistent versus volatile and
43:35
figure 2 is like has a lot of
43:36
significance for performance as well as
43:38
crash recovery and correctness any
43:43
questions about persisting yeah
43:55
yes alright so your question is
44:08
basically you're writing code say go
44:10
code for your raft implementation or
44:12
you're trying to write a real rafterman
44:13
implementation and you actually want to
44:15
make sure that when you persist your an
44:18
update to the law or the current term or
44:20
whatever that it in fact will be there
44:21
after a crash and reboot like what's the
44:23
recipe for what you have to do to make
44:26
sure it's there and your observation
44:28
that if you call you know on a UNIX or
44:31
Linux or whatever Mac if you call right
44:34
you know the right system call is how
44:36
you write to a disk file you simply call
44:38
right as you pointed out it is not the
44:41
case that after the write returns the
44:43
data is safe on disk and will survive a
44:45
reboot it almost certainly isn't almost
44:48
certainly not on disk so the you know
44:51
the particular piece of magic you need
44:53
to do is on unix at any rate you need
44:56
you need to call right so you cannot
44:58
write some file you've opened that's
45:01
going to contain the stuff that you want
45:02
to write and then you got a call this F
45:06
st. call which on most systems the
45:09
guarantee is that F sync doesn't return
45:12
until all the data you've previously
45:15
written into this file is safely on the
45:18
surface on the media in a place on a
45:22
place where it will still be there if
45:23
there's a crash so so this thing is some
45:26
then this call is an expensive call and
45:29
that's why it's a separate that's why
45:30
Wright doesn't write the disk only F
45:33
sync does is because it's so expensive
45:35
you would never want to do it unless you
45:37
really wanted to persist some stuff some
45:40
data okay so you can use more expensive
45:46
disk hardware the other trick people
45:47
play a lot is to try to batch that is if
45:51
you can if client requests are if you
45:53
have a lot of client requests coming in
45:55
maybe you should accept a lot of them
45:57
and not reply to any of them for a
45:59
little bit we call a lot of them
46:00
accumulate
46:01
and then persist you know a hundred log
46:05
entries at a time from your hundred
46:07
clients and you know only then send out
46:09
the append entries good because you do
46:12
actually have to persist this stuff to
46:13
disk if you receive a client request you
46:16
have to persist the new entry to disk
46:17
before you send the append entries our
46:20
PCs the followers because you're not
46:24
allowed if the leader you know the
46:26
leader it's essentially promising to
46:29
commit that that request and can't
46:34
forget about it
46:35
and indeed the followers have to persist
46:37
the new log entry to their disk before
46:39
they reply to the append entries because
46:40
they were apply to the append entries
46:42
it's also a promise to preserve and
46:45
eventually commit that log entry so they
46:46
can't be allowed to forget about it if
46:48
they crash other questions about
46:51
persistence all right well final you
47:01
know a little detail about persistence
47:02
is that some of the stuff in figure two
47:09
is not persistent and so it's worth
47:11
scratching your head a little bit about
47:12
why commit index lasts apply next index
47:15
and match index why it's fair game for
47:17
them to be simply thrown away if the
47:19
server crashes and restarts like why
47:22
wasn't you know commit index or last
47:24
apply it like geez last applied is the
47:26
record of how much we've executed right
47:29
if we throw that away aren't we gonna
47:30
execute log entries twice and is that
47:32
correct how about that why is why is it
47:35
safe to throw away last applied
47:46
yes I am we're all about simplicity and
47:55
safety here with raft so that's exactly
47:58
correct the the reason why all that
48:02
other stuff can be non-volatile as you
48:04
mentioned I mean sorry volatile the
48:06
reason why those other fields can be
48:07
volatile and thrown away is that we can
48:10
the leader can reconstruct sort of
48:12
what's been committed by inspecting its
48:15
own log and by the results of append
48:17
entries that it sends out to the
48:19
followers I mean initially the leader if
48:20
it if everybody restarts because they
48:22
experienced a power failure
48:23
initially the leader does not know
48:24
what's committed what's executed but
48:27
when it sends out log and append entries
48:29
it'll sort of gather back information
48:31
and essentially from the followers about
48:32
What's in how much of their logs match
48:34
the leaders and therefore how much must
48:36
have been committed before the crash
48:38
another thing in the 4-2 world which is
48:41
not the real world
48:43
another thing about figure two is that
48:45
figure two assumes that the application
48:47
state is destroyed and thrown away if
48:51
there's a crash in a restart so the
48:54
figure two world assumes that while log
48:55
is persistent that the application state
48:57
is absolutely not persistent required
49:00
not to be consistent in figure 2 because
49:04
the in figure 2 the log is preserved
49:07
persisted from the very beginning of the
49:10
system and so what's going to happen if
49:13
you sort of play out what the various
49:15
rules in figure 2 after a leader restart
49:18
is that the leader will eventually re
49:21
execute every single log entry that is
49:24
handed to the application you know
49:26
starting with log entry one after a
49:28
reboot it's the raft is gonna hand the
49:31
application every log entry starting
49:33
from one and so that will after a
49:34
restart the application will completely
49:36
reconstruct its state from scratch by a
49:39
replay from the beginning of the time of
49:41
the entire log after each restart and
49:45
again that's like a sort of
49:46
straightforward elegant plan but
49:49
obviously potentially very slow
49:56
which brings us to the next topic which
49:58
is log compaction and and snapshots and
50:04
this has a lot to do with lab 3b
50:07
actually you'll see log compaction and
50:09
snapshots in vlog 3b in lab 3b and so
50:13
the problem that log compaction and
50:15
snapshotting is solving a raft is that
50:18
indeed for a long-running system that's
50:20
been going for weeks or months or years
50:22
if we just follow the figure 2 rules the
50:25
log just keeps on growing may end up you
50:27
know millions and millions of entries
50:28
long and so requires a lot of memory to
50:30
store if you store it on disk like if
50:34
you have to persist it every time you
50:35
persist the log it's using up a huge I
50:37
may not space on disk and if a server
50:39
ever be starts it has to reconstruct its
50:41
state by replaying these millions and
50:44
millions of log entries from the very
50:46
beginning which could take like hours
50:47
for a server to run through its entire
50:50
log and we execute it if it crashes and
50:52
restarts all of which is like similar
50:54
what kind of wasted because before it
50:56
crashed it had already had applications
50:58
state and so in order to cope with this
51:08
wrath has this idea of snapshots and the
51:11
sort of idea behind snapshots is to be
51:15
able to save or ask the application to
51:18
save a copy of its state as of a
51:20
particular log entry so we've been
51:23
mostly kind of ignoring the application
51:24
but the fact is that you know if we have
51:28
a suppose we're building a key value
51:30
store under BRAF you know the log is
51:33
gonna contain a bunch of you know
51:34
putting gets or read and write request
51:37
so maybe a law contains you know a put
51:39
that some client wants to set X to one
51:42
and then another one where it says X to
51:44
2 and then you know y equals 7 or
51:47
whatever and if there's no crashes as
51:51
the raft is executing along there's
51:53
going to be this if the layer above Rath
51:55
there's going to be this application and
51:57
the application if it's a key value
51:59
store databases it's going to be meeting
52:01
this table and as raft hands it one
52:05
command after our next
52:07
the applications going to update its
52:09
table so you know after the first
52:10
command it's going to set X to one and
52:12
it's stable after the second command
52:14
it's going to update its table you know
52:19
one interesting fact is that for most
52:22
applications the application state is
52:24
likely to be much smaller than the
52:26
corresponding log right at some level we
52:29
know that the the you know the log and
52:31
the state are the log in that and the
52:33
state as of some point in the log are
52:35
kind of interchangeable right they both
52:38
sort of implied the same thing about the
52:40
state of the application but the log may
52:44
contain a lot of you know a lot of
52:46
multiple assignments 2x they use up a
52:48
lot of space in the log but are also to
52:49
effectively compact it down to a single
52:51
entry in the table and that's pretty
52:53
typical of these replicated applications
52:56
but the point is that instead of storing
53:00
the log which may go to be huge we have
53:02
the option of storing instead the table
53:05
which might be a lot smaller and that's
53:08
what the snapshots are doing so when
53:11
raft feels that it's log has gotten to
53:14
be too large you know more than a
53:17
megabyte or ten megabytes or whatever
53:19
some arbitrary limit raft will ask the
53:21
application to take make a snapshot of
53:24
it the application state as of a certain
53:26
point in the log
53:28
so if we add if raft asked the
53:30
application for a snapshot reference it
53:33
would pick a point in the log that the
53:35
snapshot referred to and require the
53:37
application to produce a snapshot as at
53:39
that point this is extremely critical
53:41
because the because what we're about to
53:44
do is throw away everything before that
53:45
point so if there's not a will to find
53:47
point that corresponds to a snapshot
53:48
then we can't safely throw away the log
53:51
before that point so that means that
53:54
Rath is gonna have you know ask for
53:57
snaps on the snap so it's basically just
53:58
the table it's just about a database
54:00
server and we also need to annotate the
54:04
snapshot with the entry number that are
54:07
corresponds to you so it's basically you
54:09
know if the entries are 1 2 3 this
54:12
snapshot corresponds to just after log
54:16
index 3 with the snapshot in hand
54:19
if we persist it to disk rats persistent
54:23
to disk raft never again will need this
54:26
part of the logs and it can simply throw
54:33
it away as long as it persists a
54:36
snapshot as of a certain in debt log
54:39
index plus the log after that index as
54:42
long as that's persisted to disk we
54:44
never going to need to log before that
54:46
and so this is what RAF does the rocks
54:49
ask the application for snapshot gets
54:51
the snapshot saves it to disk with the
54:52
log after that it just throws away this
54:54
log here right and so it really operates
54:58
or the sort of persistence story is all
55:00
about pairs of a snapshot in the log
55:03
after that after the point in the log
55:06
associated with snapshot I don't see
55:09
this yes
55:24
no it's still it's it's you know there's
55:27
these sort of phantom entries one two
55:29
three and this you know suffix of the
55:31
log is indeed viewed as still the it's
55:37
maybe the right way to think of it is
55:39
still there's just one log except these
55:41
entries are sort of phantom entries that
55:43
we that we can view as being kind of
55:46
there in principle but since we're we
55:48
never need to look at them because we
55:51
have the snapshot the fact that they
55:52
just happened not to be stored anywhere
55:53
is neither here nor there but it's but
55:57
yeah you should think of it as being
55:58
stole the same log it's just not just
56:01
threw away their early entries did this
56:04
that's a maybe a little bit too glib of
56:06
an answer because the fact is that
56:07
figure two talks about the log in ways
56:10
that makes it that if you just follow
56:12
figure to you sometimes still need these
56:14
earlier entries and so you'll have to
56:15
reinterpret figure two a little bit in
56:17
light of the fact that sometimes it says
56:19
blah blah blah a log entry where the log
56:22
entry doesn't exist okay
56:39
okay and so what happens on a restart
56:43
so the restart story is a little more
56:44
complicated in it than it used to be
56:46
with just a log what happens on a
56:48
restart is that there needs to be away
56:50
for raft to give the latest for graph to
56:54
find the latest snapshot log pair on its
56:56
disk and hand the snapshot to the
57:01
application because we no longer are
57:03
able to replay you know all the log
57:04
entries so there must be some other way
57:06
to initialize the application basically
57:08
not only is the application have to be
57:10
able to produce a snapshot of
57:11
application state but but it has to be
57:13
able to absorb a previously made
57:15
snapshot and sort of reconstruct it
57:17
stable in memory from a snapshot and so
57:20
this now even though raft is kind of
57:22
managing this whole snapshotting stuff
57:23
the snapshot contents are really the
57:26
property to the application and RAF
57:28
doesn't even understand what's in here
57:29
only the application does because it's
57:31
all full of application specific
57:33
information so after a restart the
57:36
application has to be able to absorb the
57:39
latest snapshot that raft found so for
57:45
just this simple it would be simple
57:48
unfortunately this snapshotting and in
57:52
particular the idea that the leader
57:54
might throw away part of its log
57:56
introduces a major piece of complexity
57:59
and that is that if there's some
58:01
follower out there whose log ends before
58:05
the point at which the leaders log
58:10
starts then unless we invent something
58:14
new we need monney install snapshot
58:15
unless we invent something new that
58:17
follower can never get up-to-date right
58:20
because if the followers you know if
58:23
there's some follower whose log only is
58:25
the first two log entries we no longer
58:27
have the log entry three that's required
58:29
to send it to that follower in an append
58:32
entries RPC to allow its log to catch up
58:35
to the leaders now
58:41
we could avoid this problem by having
58:44
the leader never drop part of its log if
58:47
there's any follower out there that
58:50
hasn't caught up to the point at which
58:53
the leader is thinking about doing a
58:54
snapshot because the leader knows
58:56
through next index
58:58
well actually leader doesn't really know
59:00
but the leader could know in principle
59:02
how far each follower had gotten and the
59:05
leader could say well I'm just never
59:06
gonna drop the part of my log before the
59:09
end of the follower with the shortest
59:12
log and that would be okay they might
59:16
actually just be a good idea period the
59:20
reason why that's maybe not such a great
59:21
idea is that of course if a follower
59:23
shut down for a week you know it's not
59:26
gonna be acknowledging log entries and
59:28
that means that the leader can't reduce
59:31
its memory use by snapshotting so the
59:34
way the raft designs chosen to go is
59:36
that the leader is allowed to throw away
59:40
parts of its logs that would be needed
59:42
by some follower and so we need some
59:43
other scheme that append entries to deal
59:45
with the gap between the end of some
59:48
followers log in the beginning of the
59:49
leaders log and so that solution is the
59:51
install snapshot RPC and the deal is
60:02
that when a leader we have some follower
60:06
whose log is that you know just powered
60:09
on its log as short the leaders gonna
60:12
send it and append entries and you know
60:14
it's gonna be forced the leaders gonna
60:15
be forced to backup and at some point
60:17
the leader you know failure or fail
60:19
dependent recalls will cause the leader
60:20
to realize it it's reached the beginning
60:23
of the actual log its doors and at that
60:25
point instead of sending in append
60:27
entries the leader will send its current
60:30
snapshot plus current law well send its
60:33
current snapshot to the follower and
60:35
then presumably immediately follow it
60:37
with an append entries that has the
60:40
leaders current law
60:46
questions
60:52
yeah I'm the sad truth this is like this
60:55
is adds significant complexity here
60:59
Jarrell I'm three partially because of
61:02
the kind of cooperation that's required
61:05
between raff this is sort of a little
61:07
bit of a violation of modularity it
61:08
requires a good deal cooperation like
61:12
for example when an install snapshot
61:13
comes in it's delivered to raft but raft
61:15
really requires the application to
61:17
absorb the snapshot so they have to talk
61:23
to each other more than they otherwise
61:24
might yes the question is that this is
61:33
the way the snapshot is created
61:35
dependent on the application
61:36
it's absolutely it so the snapshot
61:38
creation function is part of the
61:40
application as part of like the key
61:42
value server so raffle you know somehow
61:45
call up to the application and say geez
61:46
you know I really like a snapshot right
61:48
now in the application because only the
61:50
application understands what it's status
61:53
and you know the inverse function by
61:57
which an application reconstructs a
61:59
state from a snapshot files also totally
62:01
application dependent where there's
62:05
intertwining because of course every
62:06
snapshot has to be labeled with a point
62:09
in a log that it corresponds to
62:25
talking about rule six and figure
62:27
thirteen okay so yeah the question here
62:39
is that and you will be faced with this
62:42
in lab three that because the RPC system
62:46
isn't perfectly reliable and perfectly
62:48
sequenced and RBC's can arrive out of
62:50
order or not at all or you may send an
62:52
RPC and get no response and think it was
62:54
lost but actually was delivered and was
62:56
the reply that was lost all these things
62:58
happen including to send to whatever
63:02
install snapshot our pcs and the leaders
63:04
almost certainly sending out many our
63:06
pcs concurrently you know both append
63:08
entries and install snapshots that means
63:12
that you can get things like install
63:15
snapshot our pcs from deep in the past
63:20
almost anything else right and therefore
63:25
the the follower has to be careful you
63:29
know has to think carefully about an
63:31
install snapshot that arrives and the
63:37
yeah I think the specific thing you're
63:39
asking is that if follower receives that
63:41
an install snapshot that appears to be
63:43
completely redundant that is the install
63:46
snapshot contains information that's
63:47
older than the information the follower
63:50
already has
63:51
what should the follower do and rule six
63:55
and figure thirteen says something but I
63:57
think equally valid response to that is
63:59
that the follower can ignore a snapshot
64:01
that clearly is from the past I don't
64:07
really understand that rule six okay I
64:12
want to move on to sort of somewhat more
64:17
conceptual topic for a bit so far we
64:21
haven't really tried to nail down
64:24
anything about what it meant to be
64:27
correct
64:28
what I meant for a replicated service
64:33
already any other kind of service to be
64:36
behaving correctly and the reason why
64:39
and you know whatever for most of my
64:42
life I managed to get by without
64:44
worrying too much about precise
64:46
definitions of correctness but the fact
64:47
is that you know if you're trying to
64:49
optimize something or you're trying to
64:51
think through some weird corner case
64:53
it's often handy to actually have a more
64:55
or less formal way of deciding is that
64:58
behavior correct or not correct and so
65:00
you know for here what we're talking
65:01
about is you know clients are sending in
65:03
requests to the to our replicated
65:05
service with our PC maybe they'll be
65:07
sending who knows well maybe the service
65:09
is crash it can be starting and you know
65:11
loading snapshots or whatever the client
65:14
sends in a request and gets a response
65:15
like is that response correct how are we
65:18
supposed to how are we supposed to tell
65:20
whether response a would be correct or
65:22
response B so we need a notion we need a
65:26
pretty formal notion of distinguishing
65:27
oh that's okay from now that would be a
65:30
wrong answer and for this lab the our
65:33
notion of correctness is linearize
65:36
ability and I mentioned strong
65:42
consistency and some of the papers I
65:43
mentioned strong consistency and
65:45
basically equivalent to linearize
65:47
ability linearize ability is a sort of a
65:50
formalization of more or less of the
65:54
behavior you would expect if there was
65:57
just one server and it didn't crash and
65:59
it executed the command client requests
66:02
one at a time and you know nothing funny
66:04
ever happened so it has it has a
66:09
definition and the definition I'll write
66:12
out the definition then talk about it so
66:14
so an execution history is linearizable
66:24
linearizable and this is in the notes if
66:30
there exists a total order so an
66:33
execution history is a sequence of
66:34
client requests maybe many requests from
66:37
many clients
66:39
if there's some total order of the
66:46
operations in the history it matches the
66:53
real-time order of requests so if one
66:55
request
66:56
if client sends out a request and gets a
66:57
response and then later in time another
67:01
client sends out a request and I get a
67:02
response those two requests are ordered
67:04
because one of them's started after the
67:07
other one finished
67:08
so it's linearizable history is
67:12
linearizable if there exists an order of
67:13
the operations in the history that
67:15
matches real-time for non concurrent
67:23
requests that is for a request to didn't
67:25
overlap in time and each read you can
67:42
think of it as each read sees the value
67:44
from the most immediately preceding
67:46
right to the the same piece of data most
67:56
recent right in the order all right this
68:08
is the definition let me illustrate what
68:10
it means by running through an example
68:12
so first of all the history is a record
68:15
of client operations so this is a
68:16
definition that you can apply from
68:18
outside this definition doesn't appeal
68:20
in any way to what happens inside the
68:23
implementation or how the implementation
68:24
works it's something that we can if we
68:27
see a system operating and we can watch
68:30
the messages that come in and out we can
68:32
answer the question was that execution
68:34
that we observe linearizable so let me
68:44
let me write out of history and talk
68:47
about why it is or isn't linearizable
68:53
all right so here's an example the new
69:01
eyes ability talks about operations that
69:03
start at one point and end at another
69:05
and so this corresponds to the time at
69:07
which a client sends a request and then
69:10
later receives a reply so let us suppose
69:13
that our history says that at at some
69:16
particular time this time some client
69:19
sent a write request for the data item
69:22
named X and asked for it to be set to 1
69:24
and then time passed and at the second
69:28
vertical bar is when that client got a
69:30
reply through send a request at this
69:31
point you know time pass who knows
69:33
what's happening when the client got a
69:34
reply there and then later in time that
69:37
client or some other client doesn't
69:39
really matter
69:40
sends a write request again for item X
69:43
and value 2 and gets a response to that
69:45
right meanwhile some client sends a read
69:52
for X and gets value 2 and sent the
69:56
request there and got the response with
69:58
value 2 there and there's another
70:00
request that we observed it's a part of
70:03
the history request was sent to read
70:07
value X and it got value 1 back and so
70:12
when we have a history like this you
70:14
know the question were that you asked
70:16
about this history is is this a
70:17
linearizable history that is did the
70:20
machinery did the service did the system
70:22
that produced this history and was that
70:23
a linearizable system or did it produce
70:28
a linearizable history in this case if
70:30
this history is not linear inaudible
70:31
then then Lisa we're talking about I
70:36
have 3 we know we have a problem there
70:38
must be some some bug ok so we need to
70:42
analyze this to figure out if it's
70:43
linearizable there's linear linearize
70:45
ability requires us to produce an order
70:48
you know one by one order of the four
70:52
operations in that history so we know
70:54
we're looking for an order and there's
70:55
two constraints on the
70:57
order one is if one operation finished
71:03
before another started then the one that
71:07
finished first has to come first in the
71:08
history the other is if some read sees a
71:13
particular written value then the read
71:17
must come after the write in the order
71:20
all right so we want to order so we're
71:23
gonna produce an order that has four
71:24
entries the two rights and the two leads
71:26
I'm gonna draw with arrows that
71:29
constraints implied by those two rules
71:31
and then our order is gonna have to obey
71:33
these constraints so one constraint is
71:36
that this write finished before this
71:39
write started and therefore one of the
71:41
ordering constraints is that this write
71:44
must appear in the total order before
71:47
this write this read saw the value of
71:51
two so in the total order the most
71:56
recent right that this read must come
71:59
after this right and this write must be
72:00
the most recent right so that means that
72:03
in the total order we must see the right
72:06
of X - 2 and then after it the read of X
72:08
it yields - and this this read of X of 1
72:19
if we assume that the X didn't already
72:21
have the value 1 there there must be in
72:23
this relationship and that is the read
72:27
must come after the right and this read
72:29
also must become for this right and
72:32
maybe there's some other restrictions -
72:35
anyway we can take these we can take
72:37
this set of arrows and flatten it out
72:39
into an order and that actually works so
72:41
the order that's the total order that
72:44
demonstrates that this history is
72:45
linearizable is first the right of x - 1
72:50
then the read of x yielding 1 then the
72:56
right of x - 2 and the read of x that
73:00
yields 2
73:03
alright so the fact that there is this
73:06
order that does obey the ordering
73:07
constraints shows that this history is
73:09
linearize ability and doesn't you know
73:13
if we're worried about the system that
73:15
produced this history whether it's a but
73:17
that system is linearizable then this
73:20
particular example we saw it doesn't
73:22
contradict the presumption that the
73:24
system is linearizable any questions
73:29
about what I just did each read sees you
73:45
know read of X the value it sees must be
73:48
them value written by the most the most
73:51
recent proceeding right in the order so
73:56
you know in this case in this case we're
73:58
totally ok with this order because this
74:00
read the value it saw is indeed the
74:03
value written by the most recent write
74:04
in this order and this read the value it
74:08
sighs I mean in informally it's that
74:12
reads can't real should not be yielding
74:15
stale data if I write something in Rita
74:17
back gosh I should see the value I wrote
74:20
and that's like a formalization of the
74:21
notion that
74:27
oh yes oh yeah yeah all right let me let
74:34
me he's right up example that's not
74:40
indeed linearizable so example two let's
74:44
suppose our history is we had a right of
74:48
X value one right back with value two
75:14
and so this one we also want to write
75:16
out the arrows and so we know what the
75:20
constraints are on any total order we
75:21
might find the right of X to one because
75:26
of time because it finished in real time
75:28
before the right x to started and must
75:31
come before in any satisfying order we
75:36
produce the right of Ecsta two has to
75:38
come before the right before the read of
75:41
X that yields two so we have this arrow
75:46
the read of X had to finished before the
75:49
read of X to one started so we have this
75:51
arrow and the read of X to one because
76:00
it saw value one has to come after the
76:03
right of X - 1 and more crucially before
76:06
the right of X 2 - right so we can't
76:09
have this read of X yielding one if it's
76:12
immediately preceded by I'll write out X
76:14
- 2 so we also have this arrow like this
76:18
and because there's a cycle in these
76:23
constraints there's no order that can
76:27
obey all these constraints and therefore
76:29
this history is not linearizable and so
76:35
the system that produced it is
76:37
is not a linearizable system you know
76:42
would be linearizable the history was
76:44
missing any one of these three and I
76:47
would break the cycle yes maybe I'm not
77:05
sure because suppose or I don't know how
77:08
to incorporate very strange things like
77:11
supposing somebody red 27 you know it
77:16
doesn't really if there's no right of 27
77:18
a read of 27 doesn't at least the way
77:22
I've written out the rules doesn't sort
77:23
of well there may be some sort of anti
77:26
dependency that you would construct okay
77:29
um I will continue this discussion next
77:33
week