字幕記錄


00:00
all right well let's get started
00:06
today and indeed today and tomorrow I'm
00:10
gonna talk about raft both because I
00:14
hope it'll be helpful you for you in
00:16
implanting though the labs and also
00:19
because you know it's just a case study
00:21
in the details of how to get state
00:23
machine replication correct so we have
00:28
introduction to the problem you may have
00:31
noticed a pattern in fault-tolerant
00:33
systems that we've looked at so far
00:35
one is that MapReduce replicates
00:40
computation but the replication is
00:43
controlled the whole computation is
00:45
controlled by a single master another
00:50
example I'd like to draw your attention
00:52
to is that GFS replicates data right as
00:55
this primary backup scheme for
00:56
replicating the actual contents of files
00:58
but it relies on a single master to
01:01
choose who the primary is for every
01:03
piece of data another example vmware ft
01:07
replicates computational write on a
01:09
primary virtual machine and a backup
01:11
virtual machine but in order to figure
01:14
out what to do next if one of them seems
01:16
to a fail it relies on a single test and
01:19
set server to help the choose to help it
01:22
ensure that exactly one of the primary
01:23
of the backup takes over if there's some
01:26
kind of failure so in all three of these
01:29
cases sure there was a replication
01:31
system but sort of tucked away in a
01:34
corner in the replication system there
01:36
was some scheme where a single entity
01:37
was required to make a critical decision
01:40
about who the primary was in the cases
01:43
we care about so a very nice thing about
01:47
having a single entity decide who's
01:50
gonna be the primary is that it can't
01:53
disagree with itself right there's only
01:55
one of it makes some decision that's the
01:57
decision it made but the bad thing about
02:00
having a single entity decide like who
02:03
the primary is is that it itself as a
02:04
single point of failure and so you can
02:06
view these systems that we've looked at
02:08
it sort of pushing the real heart of the
02:11
fault tolerance
02:13
Machinery into a little corner that is
02:16
the single entity that decides who's
02:18
going to be the primary if there's a
02:20
failure now this whole thing is about
02:23
how to avoid split brain the reason why
02:25
we have to have have to be extremely
02:27
careful about making the decision about
02:29
who should be the primary if there's a
02:31
failure is that otherwise we risks split
02:33
brain and just make this point super
02:38
clear I'm gonna just remind you what the
02:45
problem is and why it's a serious
02:47
problem so supposing for example where
02:49
we want to build ourselves a replicated
02:52
test and set server that is we're
02:54
worried about the fact that vmware ft
02:56
relies on this test and set server to
02:58
choose who the primary is so let's build
03:00
a replicated testing set server i'm
03:02
gonna do this it's gonna be broken it's
03:04
just an illustration for why why it's
03:08
difficult to get this but brain problem
03:10
correctly so you know we're gonna
03:12
imagine we have a network and maybe two
03:16
servers which are supposed to be
03:18
replicas of our test and set service
03:21
connected and you know maybe two clients
03:23
they need to know who's the primary
03:24
right now or actually maybe these
03:26
clients in this case are the primary in
03:28
the back up in vmware ft so if it's a
03:34
test and set service then you know both
03:36
these databases mostly servers start out
03:38
with their state that is the state of
03:40
this test flight back in zero and the
03:42
one operation their clients can send is
03:44
the test and set operation which is
03:46
supposed to set the flag of the
03:50
replicated service to one so i should
03:52
set both copies and then return the old
03:55
value so it's essentially acts as a kind
03:57
of simplified lock server okay so the
04:02
problem situation the lowly worried
04:05
about split-brain arises when a client
04:08
can talk to one of the servers but can't
04:11
talk to the other so we're imagining
04:12
either that when clients send a request
04:14
they send it to both I'm just gonna
04:17
assume that now and almost doesn't
04:19
matter so let's assume that the protocol
04:20
is that the clients supposed to send
04:22
ordinarily any request to both servers
04:24
and somehow we you know we need
04:27
think through what the clients should do
04:29
if one of the server's doesn't respond
04:31
right or what the system should do if
04:33
one of the server seems to gotten
04:34
responsive so let's imagine now the
04:38
client one can contact server one but
04:40
not server two how should the system
04:42
react one possibility is for is that we
04:46
think well you know gosh we certainly
04:48
don't want to just talk to client to
04:49
server one because that would leave the
04:50
second replica inconsistent if we set
04:53
this value to one but didn't also set
04:54
this value to one
04:55
so maybe the rule should be that the
04:57
client is always required to talk to
04:59
both replicas to both servers for any
05:01
operation and shouldn't be allowed to
05:03
just talk to one of them so why is that
05:05
the wrong answer so the rule is o in our
05:10
replicated system the clients always
05:12
require to talk to both replicas in
05:15
order to make progress at all in fact
05:22
it's worse it's worse than talking to a
05:25
single server because now the system has
05:27
a problem if either of these servers is
05:30
crashed or or you can't talk to it at
05:33
least with a non replicated service
05:34
you're only depending on one server but
05:36
here we am both servers have to be a lot
05:37
if we require the client to talk to both
05:40
servers then both servers has to be live
05:41
so we can't possibly require the client
05:43
to actually you know wait for both
05:46
servers to respond if we don't have
05:49
fault tolerance we need it to be able to
05:50
proceed so another obvious answer is
05:52
that if the client can't talk to both
05:55
well it just talks to the one who can
05:56
talk to and figures the other ones dead
05:59
so what's up why is that also not the
06:02
right answer
06:08
the troubling scenario is if the other
06:10
server is actually alive so suppose the
06:12
actual problem or encountering is not
06:13
that this server crashed which would be
06:16
good for us but the much worse issue
06:20
that something went wrong with the
06:21
network cable and that this client can
06:23
talk to climb one can talk to server one
06:26
but not server two and there's maybe
06:27
some other client out there that conduct
06:29
a server two but not server one so if we
06:33
make the rule that if a client can talk
06:35
to both servers that it's okay in order
06:38
to be fault tolerant that I just talked
06:39
to one then what's just inevitably gonna
06:43
happen said this cable is gonna break
06:46
thus cutting the network in half client
06:48
one is gonna send a test and set request
06:51
to server one server one will you know
06:53
set it state to one and return the
06:56
previous value of zero to client one and
06:58
so that mean client mom will think it
06:59
has the lock and if it's a VMware ft
07:02
server will think it can be takeovers
07:04
primarily but this replica still of zero
07:06
in it all right so now if client to
07:08
who've also sends a test and set request
07:10
to you know what price to send them to
07:12
both sees that server one appears to be
07:14
down follows the rule that says well you
07:16
just send to the one server but you can
07:17
talk to then it will also think that it
07:22
would either quiet because client you
07:23
also think that it acquired the lock and
07:25
so now you know if we were imagining
07:27
this test and that server was going to
07:28
be used with the and where ft we have
07:29
not both replicas both of these VMware
07:35
machines I think they could be primary
07:37
by themselves without consulting the
07:41
other server so that's a complete
07:42
failure so with this set up and two
07:45
servers it seemed like we had this we
07:47
just had to choose either you wait for
07:49
both and you're not fault-tolerant or
07:52
you wait for just one and you're not
07:54
correct and then our correct version
07:57
it's often called split brain so
07:59
everybody see this well
08:09
so this was basically where things stood
08:12
until the late 80s and when people but
08:16
people did want to build replicated
08:18
systems you know like the computers that
08:20
control telephone switches or the
08:22
computers that ran banks you know there
08:24
was placer when we spend a huge amount
08:26
of money in order to have reliable
08:27
service and so they would replicate they
08:29
would build replicated systems and the
08:31
way they would deal then way would that
08:33
they would have replication but try to
08:35
rule out of rule out split brain it's a
08:38
couple of techniques one is they would
08:41
build a network that could not fail and
08:45
so usually what that means and in fact
08:47
you guys use networks that essentially
08:49
cannot fail all the time the wires
08:51
inside your laptop you know connecting
08:53
the CPU to the DRAM are effectively what
08:57
you know a network that cannot fail
08:59
between the between your CPU and DRAM so
09:02
you know with reasonable assumptions and
09:05
lots of money and you know sort of
09:07
carefully controlled physical situation
09:10
like you don't want to have a cable
09:11
snaking across the floor that somebody
09:12
can step on you know it's got to be
09:15
physically designed set up with a
09:17
network that cannot fail you can rule
09:19
out split brain it's bit of an
09:21
assumption but with enough money people
09:22
get quite close to this because if the
09:25
network cannot fail that basically means
09:27
that the client can't talk to a server
09:28
to that means server two must be down
09:31
because it can't have been the network
09:33
malfunctioning so that was one way that
09:35
people sort of built replication systems
09:38
it didn't suffer from split brain
09:42
another possibility would be to have
09:44
some human beings sort out the problem
09:46
that is don't automatically do anything
09:48
instead have the clients you know by
09:49
default clients always have to wait for
09:51
you know both replicas to respond or
09:54
something never allowed to proceed with
09:56
just one of them but you can you know
09:58
call somebody's beeper to go off some
10:00
human being goes to the machine room and
10:02
sort of looks at the two replicas and
10:04
either turns one off to make sure it's
10:07
definitely dead
10:07
or verifies that one of them has indeed
10:10
crashed and if the other is alive and so
10:13
you're essentially using the human as a
10:14
as the tie breaker and the human is a
10:17
you know if they were a computer it
10:20
would be a single point if you
10:21
themselves so for a long time people use
10:25
one of the other these schemes in order
10:27
to build replicated systems and it's not
10:28
you know they can be made to work the
10:31
humans don't respond very quickly and
10:32
the network that cannot fail is
10:34
expensive but it's not not doable but it
10:38
turned out that you can actually build
10:42
automated failover systems that can work
10:45
correctly in the face of flaky networks
10:48
of networks that could fail on the can
10:51
partition so this split of the network
10:53
in half where the two sides operate they
10:54
can't talk to each other that's usually
10:56
called a partition and the big insight
11:04
that people came up with in order to
11:06
build automated replication systems that
11:10
don't suffer from split brain is the
11:12
idea of a majority vote this is a
11:20
concept that shows up in like every
11:22
other sentence practically in the raft
11:24
paper sort of fundamental way of
11:28
proceeding the first step is to have an
11:31
odd number of servers instead of an even
11:33
number of servers like one flaw here is
11:35
that it's a little bit too symmetric all
11:37
right the two sides of the split here
11:39
just they just look the same so they run
11:41
the same software they're gonna do the
11:42
same thing and that's not good but if
11:44
you have an odd number of servers then
11:47
it's not symmetric anymore right at
11:50
least a single network split will be
11:52
presumably two servers on one side and
11:54
one server on the other side and they
11:56
won't be symmetric at all and that's
11:58
part of what majority vote majority
12:00
voting schemes are appealing to so basic
12:04
ideas you have an odd number of servers
12:05
in order to make progress of any kind so
12:08
in raft elect a leader or cause a log
12:10
entry to be committed in order to make
12:12
any progress at each step you have to
12:15
assemble a majority of the server's more
12:18
than half more than half of all the
12:20
servers in order to sort of approve that
12:22
step like vote for a meet or accept a
12:25
new log entry and commit it so you know
12:29
the most straightforward way is that two
12:32
or three servers required to do anything
12:37
one reason this works of course is that
12:40
if there's a partition there can't be
12:43
more than one partition with a majority
12:45
of the server's in it that's one way to
12:47
look at this a partition can have one
12:51
server in it which it's not a majority
12:53
or maybe you can have two but if one
12:55
partition has two then the other
12:56
partition has to have only one server in
12:58
it and therefore will never be able to
13:00
assemble a majority and won't be able to
13:02
make progress and just to be totally
13:07
clear when we're talking about a
13:09
majority it's always a majority out of
13:11
all of the server's not just a live
13:14
servers this is the point that confused
13:15
me for a long time but if you have a
13:17
system with three servers and maybe some
13:19
of them have failed or something if you
13:20
need to assemble in the majority it's
13:22
always two out of three even if you know
13:24
that one has failed the majority is
13:26
always out of the total number of
13:27
servers there's a more general
13:30
formulation of this because a majority
13:33
voting system in which two out of three
13:35
are required to make progress it can
13:37
survive the failure of one server right
13:40
any two servers are enough to make
13:42
progress if you need to be able to if
13:45
you're you know you worried about how
13:46
reliable your servers are or then you
13:49
can build systems that have more servers
13:51
and so the more general formulation is
13:53
if you have two F + 1 servers then you
13:59
can withstand you know so if it's three
14:06
that means F is one and the system with
14:09
three servers you can tolerate F servers
14:12
step one failure and still keep going
14:19
all right often these are called quorum
14:22
systems because the two out of three is
14:25
sometimes held a quorum okay so one
14:29
property I've already mentioned about
14:30
these majority voting systems is that at
14:34
most one partition can have a majority
14:36
and therefore if the networks
14:38
partitioned we can't have both halves of
14:40
the network making progress another more
14:42
subtle thing that's going on here is
14:44
that if you always need a majority of
14:48
the servers to proceed and you go
14:52
through a sort of succession of
14:54
operations in which reach operations
14:55
somebody assembled a majority like you
14:58
know votes for leaders or let's say
15:00
votes for leaders arrived then at every
15:04
step the majority you assemble for that
15:06
step must contain at least one server
15:09
that was in the previous majority that
15:11
is any two majorities overlap in at
15:13
least one server and it's really that
15:17
property more than anything else that
15:21
raft is relying on to avoid split brain
15:25
it's the fact that for example when you
15:27
have a leader a successful leader
15:28
election and leader assembles votes from
15:30
a majority its majority is guaranteed to
15:32
overlap with the previous leaders
15:34
majority and so for example the new
15:36
leader is guaranteed to know about the
15:39
term number used by the previous leader
15:41
because it's a majority overlaps with
15:43
the previous leaders majority and
15:45
everybody in the previous leaders
15:47
majority knew about the previous leaders
15:49
term number
15:50
similarly anything the previous leader
15:53
could have committed must be present in
15:55
a majority of the servers in raft and
15:57
therefore any new leaders majority must
15:59
overlap at at least one server with
16:01
every committed entry from the previous
16:04
leader this is a big part of why it is
16:08
that wrapped is correct
16:14
any questions about the general concept
16:18
of majority voting systems
16:27
these muscle ad servers
16:31
it's possible intersection something
16:34
maybe six in the paper explains how to
16:36
add it or change the set of servers and
16:41
it's possible you need to do it in a
16:44
long-running system if you're running
16:45
your system for five ten years you know
16:48
you're gonna need to replace the servers
16:50
after a while you know one of them fails
16:52
permanently or you upgrade or you move
16:55
machine rooms to a different machine
16:56
room you really do need to be able to
16:58
support changing sets of servers so
17:00
that's a it certainly doesn't happen
17:01
every day but it's a critical part of
17:03
this or a long-term maintainability of
17:05
these systems and you know the RAF
17:08
authors sort of pat themselves on the
17:10
back that they have a scheme that deals
17:13
with this which as well they might
17:14
because it's complex all right so using
17:21
this idea in about 1990 or so there were
17:25
two systems proposed at about the same
17:28
time that realized that you could use
17:31
this majority voting system to kind of
17:34
get around the apparent impossibility of
17:38
avoiding split brain by using basically
17:41
by using three servers instead of two
17:43
and taking majority votes and in one of
17:46
these very early systems was called
17:48
Paxos the RAF paper talks about this a
17:51
lot and another of these very early
17:54
systems was called view stamp
17:56
replication a previa des vs r4 view
18:00
stamp replication and even though Paxos
18:02
pod is by far the more widely known
18:05
system in this department raft is
18:07
actually closer to design in design to
18:09
view statment few stamp application
18:11
which was invented by people at MIT and
18:16
so there's a sort of a law many decade
18:19
history of these systems and they only
18:21
really came to the forefront and started
18:24
being used a lot in deployed big
18:27
distributed sisty systems about 15 years
18:29
ago a good 15 years after they were
18:31
originally invented okay so let me talk
18:39
about Rath now
18:42
raft is a takes the form of a library
18:46
intended to be included in some service
18:49
application so if you have a replicated
18:51
service that each of the replicas in the
18:53
service is gonna be some application
18:55
code which you know receives rpcs or
18:57
something plus a raft library and the
19:00
raft libraries cooperate with each other
19:01
to mean replication maintain replication
19:06
so sort of software overview of a single
19:10
raft replica is that at the top we can
19:13
think of the replicas having the
19:15
application code so it might be for lab
19:17
3 a key-value server so maybe we have
19:20
some key value server and in a state the
19:24
application has state that raft is
19:26
helping it manage replicated state and
19:28
for a key value server it's going to be
19:30
a table of keys and values the next
19:39
layer down is a raft layer so the key
19:44
value server is gonna sort of make
19:45
function calls into raft and they're
19:47
gonna chitchat back and forth a little
19:49
bit and raft keeps a little bit of state
19:52
you can see it in Figure 2 and for our
19:54
purposes really the most critical piece
19:56
of state is that raft has a log of
19:59
operations and a system with 3 breath
20:08
will cause we're actually gonna have you
20:09
know 3 servers that have exactly the
20:12
same identical structure and hopefully
20:14
the very same data sitting in sitting at
20:19
both layers
20:32
right outside of this there's gonna be
20:35
clients and the game is that so we have
20:38
you know client 1 and client two whole
20:40
bunch of clients the clients don't
20:42
really know the clients are you know
20:44
just external code that needs to be able
20:46
to use the service and the hope is the
20:49
clients won't really need to be aware
20:50
that they're talking to a replicated
20:52
service that to the clients that are
20:53
looking almost like it's just one server
20:55
and they talked with one server and so
20:59
the clients actually send client
21:00
requests to the key to the application
21:04
layer of the current leader the replica
21:08
that's the current leader in raft and so
21:11
these are gonna be you know application
21:13
level requests for a database for a key
21:15
value server these might be put in get
21:17
requests you know put takes a key and a
21:20
value and updates the table and get
21:26
asked the service to get the current key
21:29
current value corresponding to some key
21:34
so this like has nothing about to do
21:36
with raft it's just sort of
21:37
client-server interaction for whatever
21:38
service we're building but once one of
21:41
these commands gets sent from the
21:43
requests get sent from the clients of
21:44
the server what actually happens is you
21:48
know on a non replicated server the
21:50
application code would like execute this
21:53
request and say update the table and
21:54
response to a book but not in a raft
21:56
replicated service instead if assuming
21:59
the client sends a request to leader
22:00
what really happens is the application
22:04
layer simply sends the request the
22:06
clients request down into the raft layer
22:08
to say look you know here's a request
22:09
please get it committed into the
22:13
replicated log and tell me when you're
22:15
done and so at this point the rafts
22:17
chitchat with each other until all the
22:23
replicas are a majority the replicas get
22:25
this new operation into their logs said
22:29
it is replicated and then when its
22:31
leader knows that all of the replicas of
22:34
a copy of this only then as a raft sent
22:38
a notification up back up to the key
22:40
value they are saying aha that operation
22:42
you sent me I mean
22:44
it's been now committed into all the
22:46
replicas and so it's safely replicated
22:49
and at this point it's okay to execute
22:51
that operation a raft you know the
22:55
client sends a request with the key
22:57
value layer Q value layer does not
22:59
execute it yet so we're not sure because
23:01
it hasn't been replicated only when it's
23:04
in out and the logs of all the replicas
23:09
then raft notifies the leader now the
23:11
leader actually execute the operation
23:13
which corresponds to you know for a put
23:15
updating the value yet reading correct
23:20
value out of the table and then finally
23:22
sends the reply back to the client so
23:26
that's the ordinary operation of it
23:34
submitted if it's in a majority and
23:36
again the reason why I can't be all is
23:38
that if we want to build a
23:39
fault-tolerant system it has to be able
23:41
to make progress even if some of the
23:43
server's have failed so yeah so ever
23:49
it's committed when it's in a majority
23:54
[Music]
24:08
yeah and so in addition when operations
24:12
finally committed each of the replicas
24:14
sends the operation up each of the raft
24:17
library layer sends the operation up to
24:20
the local application layer in the local
24:22
application layer applies that operation
24:24
to its state its state and so they all
24:27
so hopefully all the replicas seem the
24:29
same stream of operations they show up
24:32
in these up calls in the same order they
24:34
get applied to the state in the same
24:36
order and you know assuming the
24:38
operations are deterministic which they
24:41
better be the state of the replicas
24:45
replicated State will evolve in
24:48
identically on all the replicas so
24:50
typically this this table is what the
24:52
paper is talking about when it talks
24:55
about state a different way of viewing
25:02
this interaction and one that'll sort of
25:05
notation that will come up a lot in this
25:07
course is that a sort of time diagram
25:10
I'll draw you a time diagram of how the
25:11
messages work so let's imagine we have a
25:13
client and server one is the leader that
25:18
we also have server to server three and
25:23
time flows downward on this diagram we
25:25
imagine the client sending the original
25:27
request to server one after that server
25:31
ones raft layer sends an append entries
25:35
RPC to each of the two replicas this is
25:42
just an ordinary I'll say a put request
25:44
this is append entries requests the
25:49
server is now waiting for replies and
25:51
the server's from other replicas as soon
25:55
as replies from a majority arrive back
25:58
including the leader itself so in a
26:00
system with only three about because the
26:02
leader only has to wait for one other
26:03
replica to respond positively to an
26:06
append entries as soon as it assembles
26:09
positive responses from a majority the
26:14
leader
26:15
execute a command figures out what the
26:18
answer is like forget
26:20
and sends the reply back to the client
26:25
I mean why of course you know if s who's
26:27
actually awry alive it'll send back its
26:30
response too but we're not waiting for
26:32
it although it's useful to know and
26:35
figure - all right everybody see this
26:40
this is the sort of ordinary operation
26:43
of the system no no failures
26:51
oh gosh yeah I like I left out important
26:55
steps so you know this point the leader
26:57
knows oh I got you know I'm adora t have
26:59
put it in no log I can go ahead and
27:01
execute it and reply yes to the client
27:03
because it's committed but server two
27:05
doesn't know anything yet it just knows
27:06
well you know I got this request from
27:07
the leader but I don't know if it's
27:09
committed yet depends on for example
27:11
whether my reply got back to the leader
27:13
for all server to knows it's reply was
27:15
brought by the network maybe the leader
27:16
never heard the reply and never decided
27:18
to commit this request so there's
27:20
actually another stage once the server
27:24
realizes that a request is committed it
27:28
then needs to tell the other replicas
27:31
that fact and so there are there's
27:38
there's an extra message here exactly
27:40
what that message is depends a little
27:42
bit on what what else is going on it's
27:45
at least in raft there's not an explicit
27:49
commit message instead the information
27:51
is piggybacked inside the next append
27:53
entries that leader sends out the next
27:55
append entries RPC it sends out for
27:57
whatever reason like there's a commit
28:00
meter commit or something filled in that
28:01
RPC and the next time the leader needs
28:05
have to send a heartbeat heartbeat or
28:07
needs to send out a new client request
28:10
because some different client requests
28:13
or something it'll send out the new hire
28:16
leader commit value and at that point
28:19
the replicas will execute the operation
28:25
and apply it to their state yes
28:39
oh yes so this is a this is a protocol
28:43
that has a quite a bit of chitchat in it
28:45
and it's not super fast indeed you know
28:51
yeah client sends in request request has
28:53
to get to the server the server talks to
28:54
at least you know another instance that
28:57
multiple messages has to wait for the
28:58
responses send something back so there's
29:00
a bunch of message round-trip times kind
29:02
of embedded here
29:10
yes if so this is up to you as the
29:15
implementer actually exactly when the
29:17
leader sends out the updated commit
29:21
index if client requests a comeback only
29:26
very occasionally then you know the
29:29
leader may want to send out a heartbeat
29:30
or send out a special append entries
29:33
message if client requests come quite
29:37
frequently then it doesn't matter
29:38
because if they come you know there's
29:40
thousand arrive per second and gee so
29:42
it'll be another one along very soon and
29:43
so you can piggyback so without
29:45
generating an extra message which is
29:46
somewhat expensive you can get the
29:48
information out on the next message you
29:50
were gonna send anyway in fact I I don't
29:53
think the time at which the replicas
29:58
execute the request is critical because
30:02
nobody's waiting for it at least if
30:04
there's no failures if there's no
30:06
failures replicas executing the request
30:10
isn't really on the critical path like
30:12
the client isn't waiting for them the
30:13
client saw me waiting for the leader to
30:15
execute so it may not be that it may not
30:20
affect client perceived latency sort of
30:23
exactly how this gets staged
30:37
all right one question you should ask is
30:45
why does the system why is the system so
30:48
focused on blogs what are the logs doing
30:52
and it's sort of worth trying to come up
30:54
with an explicit answers to that one
30:56
answer to why the system is totally
31:00
focused on logs is that the log is the
31:04
kind of mechanism by which the leader
31:05
orders operations it's vital for these
31:08
replicated state machines that all the
31:10
replicas apply not just the same client
31:13
operations to their start but the same
31:15
operations in the same order but they
31:18
all have to apply that these operations
31:20
coming from the clients in the same
31:22
order and the log among many other
31:24
things is part of the machinery by which
31:26
the or the leader assigns an order to
31:30
the incoming client operations I give
31:32
you know ten clients send operations to
31:34
the leader at the same time the client
31:36
the leader has to pick pick an order
31:37
make sure everybody all the replicas
31:39
obey that order and the log is you know
31:41
the fact that the log has numbered slots
31:44
as part of half a meter expresses the
31:46
order it's chosen another use of the log
31:52
is that between this point and this
31:56
point server 3 has received an operation
32:00
that it is not yet sure is committed and
32:02
it cannot execute it yet it has to put
32:04
the this operation aside somewhere until
32:07
the increment to the leader commit value
32:11
comes in and so another thing that the
32:13
log is doing is that on the followers
32:15
the log is the place where the follower
32:17
sort of sets aside operations that are
32:18
still tentative that have arrived but
32:20
are not yet known to be committed and
32:21
they may have to be thrown away as we'll
32:23
see so that's another use I'm the I sort
32:27
of do love that use on the leader side
32:29
is that the leader needs to remember
32:33
operations in its log because it may
32:36
need to retransmit them to followers if
32:38
some followers offline maybe it's
32:40
something briefly happened to its
32:41
network
32:42
action or something misses some messages
32:44
the leader needs to be able to resend
32:46
log messages that any followers missed
32:49
and so the leader needs a place where
32:50
can set aside copies of messages of
32:52
client requests even ones that it's
32:54
already executed in order to be able to
32:56
resend them to the client I mean we send
33:00
them to replicas that missed missed that
33:04
operation and a final reason for all of
33:05
them to keep the log is that at least in
33:07
the world of figure 2 if a server
33:11
crashes and restarts and wants to rejoin
33:15
and you really need if it you really
33:17
want a server that crashes - in fact we
33:19
start and rejoin the raft cluster
33:21
otherwise you're now operating with only
33:23
two out of three servers and you can't
33:24
survive any more failures we need to
33:26
reincorporate failed and rebooted
33:29
servers and the log is sort of where or
33:31
what a server rebooted server uses the
33:34
log persisted to its disk because one of
33:37
the rules is that each raft server needs
33:39
to write its log to its disk where it
33:41
will still be after it crashes and
33:42
restarts that log is what the server
33:44
uses or replays the operations in that
33:48
log from the beginning to sort of create
33:50
its state as of when it crashed and then
33:52
then it carries on from there so the log
33:54
is also used as part of the persistence
33:56
plan as a sequence of commands to
33:58
rebuild the state
34:16
well ultimately okay so the question is
34:20
suppose the leader is capable of
34:23
executing a thousand client commands a
34:25
second and the followers are only
34:26
incapable of executing a hundred client
34:28
commands per second that's sort of
34:30
sustained rate you know full speed v so
34:36
one thing to note is that the the
34:41
replicas the followers acknowledge
34:43
commands before they execute them so
34:45
they mate rate at which they acknowledge
34:47
and accumulate stuff in their logs is
34:48
not limited so you know maybe they can
34:51
acknowledge that a thousand requests per
34:52
second if they do that forever then they
34:55
will build up unbounded size logs
34:57
because their execution rate falls it
35:00
will fall on an unbounded amount behind
35:02
the rate at which the leader has given
35:04
the messages sort of under the rules of
35:06
our game and so what that means they
35:08
will eventually run out of memory at
35:11
some point so after they have a billion
35:13
after they fall a billion log entries
35:15
behind those just like they'll call the
35:16
memory allocator for space for a new
35:18
blog entry and it will fail so yeah and
35:22
Raph doesn't Raph doesn't have the flow
35:27
controls that's required to cope with
35:30
this so I think in a real system you
35:34
would actually need you know probably
35:36
piggybacked and doesn't need to be
35:37
real-time but you probably need some
35:39
kind of additional communication here
35:43
that says well here's how far I've
35:45
gotten in execution so that the leader
35:48
can say well you know too many thousands
35:50
of requests ahead of the point in which
35:53
the followers have executed yes I think
35:55
there's probably you know in a
35:56
production system that you're trying to
35:58
push to the absolute max you would you
36:01
might well need an extra message to
36:03
throttle the leader if it got too far
36:05
ahead
36:31
okay so the question is if if one of
36:36
these servers crashes it has this log
36:38
that it persisted to disk because that's
36:39
one of the rules of figure two so the
36:42
server will be able to be just logged
36:43
back from disk but of course that server
36:47
doesn't know how far it got in executing
36:49
the log and also it doesn't know at
36:52
least when it first reboots by the rule
36:54
that figure two it doesn't even know how
36:56
much of the log is committed so the
36:59
first answer to your question is that
37:00
immediately after a restart you know
37:03
after a server crashes and restarts and
37:05
reads its log it is not allowed to do
37:07
anything with the log because it does
37:10
not know how far the system has
37:11
committed in its log maybe as long as
37:14
has a thousand uncommitted entries and
37:16
zero committed entries for all it notes
37:18
so
37:24
it's a leader dye support that doesn't
37:26
help either but let's suppose they've
37:28
all crashed this is getting ahead of its
37:32
getting a bit ahead of me but well
37:33
suppose they've all crashed and so all
37:34
they have is the state that was marked
37:37
as non-volatile in figure 2 which
37:40
includes the log and maybe the latest
37:42
term and so they don't know some if
37:45
there's a crash but they all crash and
37:46
they always start none of them knows
37:48
initially how far they had been have
37:52
executed before the crash so what
37:55
happens is that you leader election one
37:57
of them gets picked as a leader and that
38:00
leader if you sort of track through what
38:03
figure 2 says about how a pendant Rees
38:06
is supposed to work the leader will
38:08
actually figure out as a byproduct of
38:10
sending out a pendant or sending out the
38:12
first heartbeat really it'll fake it'll
38:16
figure out what the latest point is
38:19
basically that that all of the that a
38:28
majority of the replicas agree on their
38:30
laws because that's the commit point
38:33
another way of looking at it is that
38:35
once you choose a leader through the
38:37
append entries mechanism the leader
38:39
forces all of the other replicas to have
38:41
identical logs to the leader and at that
38:44
point plus a little bit of extra the
38:46
paper explains at that point since the
38:48
leader knows that it's forced all the
38:50
replicas to have it I didn't have logs
38:52
that are identicals to it then it knows
38:54
that all the replicas must also have a
38:57
there must be a majority of replicas
39:00
with that all those log injuries in that
39:03
logs which are now are identical must
39:04
also be committed because they're held
39:06
on a majority of replicas and at that
39:09
point a leader you know the append
39:13
entries code described in Figure 2 for
39:15
the leader will increment the leaders
39:17
commit point and everybody can now
39:19
execute the entire log from the
39:21
beginning and recreate their state from
39:24
scratch possibly extremely laborious Lee
39:29
so that's what figure two says it's
39:32
obviously this be executing from scratch
39:34
is not very attractive but it's where
39:37
the basic protocol does and we'll see
39:40
tomorrow that the the sort of version of
39:42
this is more efficient to use as
39:44
checkpoints and we'll talk about
39:45
tomorrow okay so this was a sequence in
39:50
sort of ordinary non failure operation
39:55
another thing I want to briefly mention
39:57
is what this interface looks like you've
40:00
probably all seen a little bit of it due
40:03
to working on the labs but roughly
40:05
speaking if you have let's say that this
40:07
key value layer with its state and the
40:12
raft layer underneath it there's on each
40:16
replica there's really two main pieces
40:18
of the interface between them there's
40:20
this method by which the key value layer
40:24
can relay if a client sends in a request
40:26
the key value layer has to give it to
40:27
wrap and say please you know fit this
40:29
request into the log somewhere and
40:31
that's the start function that you'll
40:36
see in Raph go and really just takes one
40:40
argument the client command the key
40:44
value they're saying please I got this
40:45
command to get into the log and tell me
40:47
when it's committed and the other piece
40:50
of the interface is that by and by the
40:54
raft layer will notify the key value
40:55
layer that AHA that operation that you
40:58
sent to me in a start command a while
40:59
ago which may well not be the most
41:01
recent start right there you know a
41:03
hundred client commands could come in
41:05
and cause calls to start before any of
41:07
them are committed so by and by this
41:11
upward communication is takes the form
41:14
of a message on a go channel that the
41:16
raft library sends on and key value
41:20
layer is supposed to read from so
41:23
there's this apply called the apply
41:28
channel and on it on it you send apply
41:31
message
41:37
this start and of course you need the
41:39
the key value layer needs to be able to
41:42
match up message that receives an apply
41:44
channel with calls to start that it made
41:47
and so the start command actually
41:49
returns enough information for that
41:52
matchup to happen it returns the index
41:54
that start functions basically returns
41:58
the index in the log where if this
42:00
command is committed which it might not
42:02
be it'll be committed at this index and
42:05
I think it also returns the current term
42:07
and some other stuff we don't care about
42:08
very much and then this apply message is
42:11
going to contain the index command and
42:26
all the replicas will get these apply
42:27
messages so they'll all know though I
42:29
should apply this command figure out
42:33
what this command means and apply it to
42:35
my local State and they also get the
42:37
index the index is really only useful
42:38
I'm the leader so it can figure out what
42:42
client would what client requests were
42:43
talking about
43:00
by
43:14
the answer a slightly different question
43:16
let's suppose the client sends any
43:18
request in let's say it's a put or a get
43:21
could be put or again it doesn't really
43:23
matter I'd say it to get the point in
43:29
which the it's a client sense and again
43:32
and waits for a response the point at
43:33
which the leader will send a response at
43:37
all is after the leader knows that
43:39
command is committed so this is going to
43:41
be a sort of get reply so the client
43:48
doesn't see anything back I mean and so
43:52
that means in terms of the actual
43:54
software stack that means that the key
43:56
value the RPC will arrive the key value
43:59
layer will call the start function the
44:02
start function will return to the key
44:03
value layer but the key value layer will
44:06
not yet reply to the client because it
44:08
does not know if it's good actually it
44:10
hasn't executed the clients request now
44:12
it doesn't even know if it ever will
44:13
because it's not sure if the request is
44:16
going to be committed right in the
44:18
situation which may not be committed is
44:20
if the key value layer you know guess
44:23
the request calls start and immediately
44:25
after starboard turn two crashes right
44:27
certainly hasn't sent out its apply what
44:28
append messages or whatever
44:30
nothing's be committed yep so so the
44:33
game is start returns time passes the
44:36
relevant apply message corresponding to
44:40
that client request appears to the key
44:42
value server on the apply channel and
44:44
only then and that causes the key value
44:47
server to execute the request and send
44:50
her a plot
44:58
and that's like all this is very
45:00
important when it doesn't really matter
45:02
if all everything goes well but if
45:04
there's a failure we're now at the point
45:06
where we start worrying about theatres I
45:07
mean extremely interested in if there
45:09
was a failure what did the client see
45:13
all right and so one thing that does
45:18
come up is that all of you should be
45:23
familiar with this that at least
45:24
initially one interesting thing about
45:26
the logs is that they may not be
45:28
identical there are a whole bunch of
45:30
situations in which at least for brief
45:33
periods of time the ends of the
45:36
different replicas logs may diverge like
45:39
for example if a leader starts to send
45:41
out a round of append messages but
45:43
crashes before it's able to send all
45:45
them out you know that'll mean that some
45:46
of the replicas that got the append
45:48
message will append you know that new
45:50
log entry and the ones that didn't get
45:51
that append messages RPC won't have
45:54
appended them so it's easy to see that
45:56
the logs are I'm gonna diverge sometimes
45:58
the good news is that the the way a raft
46:02
works actually ends up forcing the logs
46:05
to be identical after a while there may
46:08
be transient differences but in the long
46:10
run all the logs will sort of be
46:13
modified by the leader until the leader
46:15
insurers are all identical and only then
46:17
are they executed okay so I think the
46:24
next there's really two big topics to
46:27
talk about here for raft one is how
46:29
leader election works which is lab two
46:31
and the other is how the leader deals
46:35
with the different replicas logs
46:37
particularly after failure so first I
46:39
want to talk about leader election
46:44
question to ask is how come the system
46:47
even has a leader why do we need a
46:48
leader the part of the answer is you do
46:51
not need a leader to build a system like
46:53
this you it is possible to build an
46:56
agreement system by which a cluster of
46:59
servers agrees you know the sequence of
47:02
entries in a log without having any kind
47:04
of designated leader
47:05
and indeed the original pack so system
47:07
which the paper refers to original Paxos
47:09
did not have a leader so it's possible
47:13
the reason why raft has a leader is
47:15
basically that there's probably a lot of
47:18
reasons but one of the foremost reasons
47:20
is that you can build a more efficient
47:22
in the common case in which the server's
47:24
don't fail it's possible to build a more
47:27
efficient system if you have a leader
47:28
because with a designated leader
47:30
everybody knows who the leader is you
47:33
can basically get agreement on requests
47:37
that with one round of messages per
47:39
request where as leader of this systems
47:41
have more of the flavor of well you need
47:43
a first round to kind of agree on a
47:45
temporary leader and then a second round
47:47
actually send out the requests so it's
47:50
probably the case that use of a leader
47:53
now speeds up the system by a factor two
47:56
and it also makes it sort of easier to
47:58
think about what's going on raft goes
48:04
through a sequence of leaders and it
48:08
uses these term numbers in order to sort
48:11
of disambiguate which leader we're
48:13
talking about it turns out that
48:14
followers don't really need to know the
48:15
identity of the leader they really just
48:17
need to know what the current term
48:18
number is each term has at most one
48:23
leader that's a critical property you
48:25
know for every term there might be no
48:27
leader during that term or there might
48:29
be one leader but there cannot be two
48:31
leaders during the same term every term
48:34
has it must most one leader how do the
48:42
leaders get created in the first place
48:44
every raft server keeps this election
48:48
timer which is just a it's basically
48:50
just out of time that it has recorded
48:52
that says well if that time occurs I'm
48:54
going to do something and the something
48:56
that it does is that if an entire leader
48:59
election period expires without the
49:02
server having heard any message from the
49:04
current leader then the server sort of
49:08
assumes probably that the current leader
49:10
is dead and starts an election so we
49:12
have this election timer
49:17
and if it expires we start an election
49:28
and what it means to start an election
49:30
is basically that you increment the term
49:35
the the candidate the server that's
49:38
decided it's going to be a candidate and
49:39
sort of force a new election first
49:41
increments this term because it wants
49:43
them to be a new leader namely itself
49:45
and you know leader a term can't have
49:47
more than one leader so we got to start
49:49
a new term in order to have a new leader
49:51
and then it sends out these requests
49:54
boats are pea seeds I'm going to send
50:00
out a full round of request votes and
50:02
you may only have to send out n minus
50:05
one requests votes because one of the
50:06
rules is that a new candidate always
50:08
votes for itself in the election so one
50:13
thing to note about this is that it's
50:16
not quite the case that if the leader
50:17
didn't fail we won't have an election
50:19
but if the leader does fail then we will
50:22
have election an election assuming any
50:24
other server is up because some day the
50:26
other servers election timers go will go
50:28
off but as leader didn't fail we might
50:30
still unfortunately get an election so
50:32
if the network is slow or drops a few
50:34
heartbeats or something we may end up
50:37
having election timers go off and even
50:38
though there was a perfectly good leader
50:40
we may nevertheless have a new election
50:42
so we have to sort of keep that in mind
50:43
when we're thinking about the
50:44
correctness and what that in turn means
50:48
is that if there's a new election it
50:50
could easily be the case that the old
50:52
leader is still hanging around and still
50:54
thinks it's the leader like if there's a
50:56
network partition for example and the
50:58
old leader is still alive and well in a
51:00
minority partition the majority
51:03
partition may run an election and indeed
51:05
a successful election and choose a new
51:06
leader all totally unknown to the
51:09
previous leader so we also have to worry
51:11
about you know what's that previous
51:13
leader gonna do since it does not know
51:15
there was a new election yes
51:42
okay so the question is are there can
51:44
there be pathological cases in which for
51:46
example one-way network communication
51:50
can prevent the system from making
51:52
progress I believe the answer is yes
51:54
certainly so for example if the current
51:56
leader if its network somehow half fails
52:00
in a way the current leader can send out
52:02
heartbeats
52:04
but can't receive any client requests
52:07
then the heartbeats that it sends out
52:09
which are delivered because it's
52:11
outgoing network connection works its
52:13
outgoing heartbeats will suppress any
52:18
other server from starting an election
52:20
but the fact that it's incoming Network
52:22
why or apparently is broken will prevent
52:24
it from hearing and executing any client
52:26
commands it's absolutely the case that
52:29
raft is not proof against all sort of
52:35
all crazy Network problems that can come
52:37
up I believe the ones I've thought about
52:38
I believe are fixable in the sense that
52:42
the we could solve this one by having a
52:46
sort of requiring a two-way heartbeat in
52:49
which if the leader sends out heartbeats
52:51
but you know there were in which
52:53
followers are required to reply in some
52:55
way to heartbeats I guess they are
52:56
already required to apply if the leader
52:59
stop seeing replies to its heartbeats
53:01
then after some amount of time and which
53:04
is seasonals replies the leader decides
53:06
to step down I feel like that specific
53:09
issue can be fixed and many others can
53:12
too but I but you know you're absolutely
53:16
right that very strange things can
53:17
happen to networks including some that
53:19
the protocol is not prepared for
53:28
okay so we got these meter elections we
53:32
need to ensure that there is at most at
53:33
most one meter per term
53:35
how does Rath do that well Rath requires
53:38
in order to be elected for a term Raft
53:40
requires a candidate to get yes votes
53:42
from a majority of the server's the
53:46
servers and each server will only cast
53:47
one yes vote per term so in any given
53:52
term you know it basically means that in
53:55
any given term Easter votes only once
53:58
for only one candidate you can't have
54:01
two candidates both get a majority of
54:03
votes because everybody votes only once
54:06
so the majorities majority rule causes
54:09
there to be at most one winning
54:11
candidate and so then we get at most one
54:17
candidate elected per turn
54:24
and in addition critically the majority
54:28
rule means that you can get elected even
54:31
if some servers have crashed right if a
54:34
minority of servers are crashed aren't
54:36
available and network problems we can
54:37
still elect a leader if more than half a
54:39
crash or not available or in another
54:41
partition or something then actually the
54:43
system will just sit there trying again
54:44
and again to elect a leader and never
54:47
elect one if it cannot in fact they're
54:49
not a majority of live servers if an
54:54
election succeeds everybody would be
54:57
great if everybody learned about it I
54:58
mean need to ask ourselves how do all
55:01
the parties learn learn what happened
55:02
the server that wins an election
55:04
assuming it doesn't crash the server
55:07
that wins election will actually see a
55:09
majority or positive votes for its
55:12
request vote from a majority of the
55:15
other servers so the candidates running
55:17
the election that wins it the Kennedy
55:19
that wins the election will actually
55:20
know directly uh I got a majority of
55:22
votes but nobody else directly knows who
55:26
the winner was or whether anybody one so
55:28
the way that the candidate informs other
55:30
servers is that heartbeat the rules and
55:33
figure to say oh if you're in an
55:34
election your immediately required to
55:36
send out an independent
55:37
trees to all the other servers now the
55:39
append entries that heartbeat append
55:41
entries doesn't explicitly say I won the
55:45
election you know I'm a leader for term
55:47
23 it's a little more subtle than that
55:51
the the way the information is
55:53
communicated is that no one is allowed
55:57
to send out an append entries unless
56:00
they're a leader for that term so the
56:02
fact that I I'm a you know I'm a server
56:05
and I saw oh there's an election for
56:07
term 19 and then by-and-by I sent an
56:09
append entries whose term is 19 that
56:12
tells me that somebody I don't know who
56:15
but somebody won the election so that's
56:18
how the other servers knows they were
56:19
receiving append entries for that term
56:21
and that append entries also has the
56:24
effect of resetting everybody's election
56:27
time timer so as long as the leader is
56:30
up and it sends out heartbeat messages
56:32
or append entries at least you know at
56:34
the rate that's supposed to every time a
56:36
server receives an append entries it'll
56:38
reset its selection timer and sort of
56:42
suppress anybody from being a new
56:45
candidate so as long as everything's
56:47
functioning the repeated heartbeats will
56:49
prevent any further elections of course
56:52
it the network fails or packets are
56:53
dropped there may nevertheless be an
56:55
election but if all goes well we're sort
56:57
of unlikely to get an election this
57:03
scheme could fail in the sense that it
57:05
can't fail in the sense of electing to
57:07
leaders fair term but it can fail in the
57:09
sense of electing zero leaders for a
57:11
term that's sort of morningway it may
57:14
fail is that if too many servers are
57:16
dead or unavailable or a bad network
57:18
connection so if you can't assemble a
57:19
majority you can't be elected nothing
57:21
happens the more interesting way in
57:24
which an election can fail is if
57:27
everybody's up you know there's no
57:30
failures no packets are dropped but two
57:33
leaders become candidate close together
57:35
enough in time that they split the vote
57:38
between them or say three leaders
57:45
so supposing we have three liters
57:46
supposing we have a three replica system
57:49
all their election timers go off at the
57:51
same time every server both for itself
57:54
and then when each of them receives a
57:57
request vote from another server well
57:59
it's already cast its vote for itself
58:00
and so it says no so that means that it
58:02
all three of the server's needs to get
58:04
one vote each nobody gets a majority and
58:05
nobody's elected so then their election
58:09
timers will go off again because the
58:11
election timers only be said if it gets
58:12
an append entries but there's no leader
58:14
so no append entries they'll all have
58:16
their election timers go off again and
58:17
if we're unlucky
58:19
they'll all go off at the same time
58:20
they'll all go for themselves nobody
58:22
will get a majority so so clearly I'm
58:27
sure you're all aware at this point
58:28
there's more to this story and the way
58:31
Raft makes this possibility of split
58:35
votes unlikely but not impossible
58:38
is by randomizing these election timers
58:41
so the way to think of it and the
58:44
randomization the way to think of it is
58:46
that supposing you have some time line
58:47
I'm gonna draw a vents on there's some
58:52
point at which everybody received the
58:54
last append entries right and then maybe
58:57
the server died let's just assume the
58:58
server send out a last heartbeat and
59:01
then died well all of the followers have
59:08
this we set their election timers when
59:11
they received at the same time because
59:13
they probably all receive this append
59:14
enters at the same time they all reset
59:16
their election timers for some point in
59:18
the future the future but they chose
59:21
different random times in the future
59:23
which then we're gonna go off
59:25
so it's suppose the dead leader server
59:27
one so now server two and server 3 at
59:30
this point set their election timers for
59:32
a random point in the future let's say
59:34
server to set their I like some timer to
59:37
go off here and server 3 set its
59:41
election timer to go off there and the
59:43
crucial point about this picture is that
59:46
assuming they picked different random
59:48
numbers one of them is first and the
59:51
other one is second right that's what's
59:54
going on here and the one that's first
59:56
assuming
59:58
this gap is big enough the one that's
60:00
first it's election time will go off
60:01
first before the other ones election
60:02
timer and if we're close were not
60:05
unlucky
60:06
it'll have time to send out a full round
60:08
of vote requests and get answers from
60:11
everybody who everybody's alive before
60:13
the second election timer goes off from
60:16
any other server so does everybody see
60:22
how the randomization D synchronizes
60:26
these candidates unfortunately there's a
60:30
bit of art in setting the contents
60:33
constants for these election timers
60:34
there's some sort of competing
60:36
requirements you might want to fulfill
60:40
so one obvious requirement is that the
60:43
election timer has to be at least as
60:45
long as the expected interval between
60:47
heartbeats
60:47
you know this is pretty obvious that the
60:49
leader sends out heartbeats every
60:51
hundred milliseconds you better make
60:53
sure there's no point in having the
60:55
election time or anybody's election time
60:57
or ever go off Borja for 100
60:58
milliseconds because then it will go off
61:00
before the lower limit is certainly the
61:06
lower limit is one heartbeat interval in
61:08
fact because the network may drop
61:10
packets you probably want to have the
61:13
minimum election timer value be a couple
61:16
of times the heartbeat interval so 400
61:18
millisecond heartbeats you probably want
61:20
to have the very shortest possible
61:21
election time or be you know say 300
61:24
milliseconds you know three times the
61:26
heartbeat interval so that's the sort of
61:29
minimum is the heart heartbeat so this
61:33
frequent you want the minimum to be you
61:35
know a couple of times that or here so
61:39
what about the maximum you know you're
61:40
gonna presumably randomize uniformly
61:43
over some range of times you know where
61:45
should we set the kind of maximum time
61:50
that we're randomizing over there's a
61:54
couple of considerations here in a real
61:57
system you know this maximum time effect
62:04
how quickly the system can recover from
62:06
failure because remember from the time
62:09
at which the server fails until the
62:11
first election timer goes off the whole
62:14
system is frozen there's no leader you
62:17
know the clients requests are being
62:18
thrown away because there's no leader
62:20
and we're not assigning a new leader
62:22
even though you know presumably these
62:24
other servers are up so the beer we
62:27
choose this maximum the long or delay
62:29
we're imposing on clients before
62:32
recovery occurs you know whether that's
62:34
important depends on sort of how high
62:38
performance we need this to be and how
62:40
often we think there will be failures
62:42
failures happen once a year then who
62:44
cares
62:46
we're expecting failures frequently we
62:48
may care very much how long it takes to
62:51
recover okay so that's one consideration
62:53
the other consideration is that this gap
62:56
that is the expected gap in time between
62:59
the first time are going off and the
63:01
second timer going off this gap really
63:04
in order to be useful has to be longer
63:07
than the time it takes for the candidate
63:09
to assemble votes from everybody that is
63:12
longer than the expected round-trip time
63:15
the amount of time it takes to send an
63:16
RPC and get the response and so maybe it
63:19
takes 10 milliseconds to send an RPC and
63:21
get a response a response from all the
63:25
other servers and if that's the case we
63:27
need to make maximum at least long
63:28
enough that there's pretty likely to be
63:30
10 milliseconds difference between the
63:32
smallest random number and the next
63:34
smallest random number
63:40
and for you the test code will get upset
63:46
if you if you don't recover from a
63:54
leader failure in a couple seconds and
63:56
so just pragmatically you need to tune
63:58
this maximum down so that it's highly
64:00
likely that you'll be able to complete a
64:03
leader election within a few seconds but
64:05
that's not a very tight constraint any
64:09
questions about the election time outs
64:13
one tiny point is that you want to
64:19
choose new random time outs every time
64:23
there's every time you every time I node
64:26
sets it to like me sets its election
64:28
timer that is don't choose a random
64:31
number when the server is first created
64:34
and then we use that same number over
64:36
and over again because you make an
64:37
unlucky choice that is you choose this
64:39
one server happens by ill chance to
64:42
choose the same random number as another
64:44
server that means that you're gonna have
64:46
split votes over and over again forever
64:48
that's why you want to almost certainly
64:51
choose a different a new fresh random
64:53
number for the election time out value
64:56
every time you reset the timer all right
65:02
so the final issue about leader election
65:06
suppose we are in this situation where
65:09
the old leaders partition you know the
65:11
network cable is broken and the old
65:12
leader is sort of out there with a
65:14
couple clients and a minority of servers
65:17
and there's a majority in the other half
65:20
of the network and the majority of the
65:21
new half of the network elects a new
65:22
leader what about the old leader why why
65:28
won't the old leader cause incorrect
65:32
execution
66:06
yes to two potential problems one is or
66:09
one some non problem is that if there's
66:12
a leader off in another partition and it
66:14
doesn't have a majority then the next
66:17
time a client sends it a request that
66:20
that leader that you know in a partition
66:24
with a minority yeah it'll send out
66:25
append entries but because it's in the
66:27
minority partition it won't be able to
66:29
get responses back from a majority of
66:31
the server's including itself and so it
66:33
will never commit the operation it will
66:36
never execute it
66:37
it'll never respond to the client saying
66:39
that it executed it either and so that
66:42
means that yeah an old server often a
66:45
different partition people many clients
66:47
may send a request but they'll never get
66:49
responses so no client will be fooled
66:54
into thinking that that old server
66:56
executed anything for it the other sort
67:02
of more tricky issue which actually I'll
67:05
talk about in a few minutes is the
67:07
possibility that before server fails it
67:10
sends out append entries to a subset of
67:14
the servers and then crashes before
67:17
making a commission and as a very
67:21
interesting question which I'll probably
67:24
spend a good 45 minutes talking about
67:27
and so actually before I turn to the
67:30
back topic in general any more questions
67:33
about in leader election okay
67:42
okay so how about the contents of the
67:44
logs and how in particular how a newly
67:49
elected leader possibly picking up the
67:51
pieces after an awkward crash of the
67:54
previous leader how does a newly elected
67:56
leader sort out the possibly divergent
67:59
logs on the different replicas in order
68:02
to restore sort of consistent state in
68:06
the system all right so the first
68:14
question is what can think this is this
68:17
whole topic it's really only interesting
68:19
after a server crashes right if the
68:21
server stays up then relatively few
68:25
things can go wrong if we have a server
68:26
that's up and has a majority you know
68:28
during the period of time when it's up
68:29
and has a majority it just tells the
68:34
followers what the logs should look like
68:35
and the followers are not allowed to
68:37
disagree they're required to accept they
68:39
just do by the rules of figure two if
68:41
they've been more or less keeping up you
68:43
know they just take whatever the leader
68:44
sends them independent reason appended
68:46
to the log and obey commit messages and
68:48
execute there's hardly anything to go
68:50
wrong the things that go wrong in Rapp
68:52
go wrong when a the old leader crashes
68:55
sort of midway through you know sending
68:58
out messages or a new leader crashes you
69:01
know sort of just after it's been
69:03
elected but before it's done anything
69:06
very useful so one thing we're very
69:10
interested in is what can the logs look
69:11
like after some sequence of crashes okay
69:16
so here's an example supposing we have
69:19
two servers and the way I'm gonna draw
69:26
out these diagrams because we're gonna
69:27
be looking a lot at a lot of sort of
69:30
situations where the logs look like this
69:32
and we're gonna be wondering is that
69:34
possible and what happens if they do
69:36
look like that so my notation is going
69:38
to be I'm gonna write out log entries
69:40
for each of the servers sort of aligned
69:44
to indicate slots corresponding slots in
69:47
the log and the values I'm going to
69:50
write here are the term numbers rather
69:53
than
69:54
client operations I'm going to you know
69:56
this is slot one this is thought to
69:59
everybody saw a command from term three
70:02
in slot 1 and server tuned server three
70:05
saw command from also term three and the
70:08
second slot the server one has nothing
70:10
there at all and so question for this
70:14
like the very first question is can this
70:16
arrive could this setup arise and how
70:21
yes
71:02
so you know maybe server 3 was the
71:04
leader for just repeating what you said
71:06
maybe server 3 is the leader for term 3
71:08
he got a command that sent out to
71:09
everybody everybody received a dependent
71:11
at the log and then I got a server 3 got
71:14
a second request from a client and maybe
71:18
it sent it to all three servers but the
71:19
message got lost on the way to server
71:21
one or maybe server was down at the time
71:23
or something and so only server to the
71:25
leader always append new commands to its
71:28
log before it sends out append entries
71:29
and maybe the append entry RPC only got
71:32
to server 2 so this situation you know
71:34
it's like the simplest situation and was
71:36
actually the logs are not different and
71:38
we know how it could possibly arise and
71:43
so if server 3 which is a leadership
71:45
crash now you know the next server
71:46
they're gonna need to make sure server 1
71:49
well first of all if server 3 crashes or
71:54
we'll be at an election and some of the
71:56
leader is chosen you know two things
71:57
have to happen the new leader has got to
72:00
recognize that this command could have
72:04
committed it's not allowed to throw it
72:06
away
72:07
and it needs to make sure server one
72:09
fills in this blank here with indeed
72:12
this very same command that everybody
72:13
else had in that slot all right so after
72:17
a crash somebody you know server 3
72:20
suppose another way this can come up is
72:22
server 3 might have sent out the append
72:24
entries the server 2 but then crashed
72:26
before sending the append entries to
72:27
server 3 so if were you know electing a
72:30
new leader it could because we got a
72:32
crash before the message was sent
72:34
here's another scenario to think about
72:37
three servers again no I mean a number
72:44
the slots in the law and so we can refer
72:48
to them got slot 10 11 12 13
72:55
[Music]
72:57
again it's same setup except now we have
73:04
in slide 12 we have server 2 as a
73:07
command from term for and server 3 has a
73:11
term command from term 5
73:15
so you know before we analyze these to
73:19
figure out what would happen and what
73:21
would a server do if it saw this we need
73:22
to ask could this even occur because
73:24
sometimes the answer to the question oh
73:27
jeez what would happen if this
73:28
configuration arose sometimes the answer
73:30
is it cannot arise so we do not have to
73:32
worry about it the question is could
73:37
this arise and how
73:57
all right so any
74:12
[Music]
74:52
yeah
74:59
in brief we know this configuration can
75:02
arise and so the way we can then get the
75:05
four and a five here is let's suppose in
75:07
the next leader election server twos
75:08
elected leader now for term for its
75:11
elected leader because a request from a
75:13
client it appends it to its own log and
75:15
crashes so now we have this right we
75:20
need a new election because the leader
75:21
just crashed now in this election and
75:25
then so now we have to ask whether who
75:27
could be elected or we have to give him
75:29
back of our heads oh gosh what could be
75:30
elected so we're gonna claim server
75:32
three could be elected the reason why I
75:34
could be elected is because it only
75:35
needs request vote responses from
75:38
majority that majority is server one and
75:40
server three you know there's no no
75:42
problem no conflict between these two
75:44
logs so server three can be elected for
75:46
term five get a request from a client
75:48
append it to its own log and crash and
75:51
that's how you get this this
75:54
configuration so you know you need to be
75:57
able to to work through these things in
76:04
order to get to the stage of saying yes
76:05
this could happen and therefore raft
76:07
must do something sensible as opposed to
76:09
it cannot happen because some things
76:11
can't happen
76:17
all right so so what can happen now we
76:23
know this can occur so hopefully we can
76:27
convince ourselves that raft actually
76:29
does something sensible now as for the
76:34
range of things before we talk about
76:36
what RAF would actually would actually
76:39
do we need to have some sense of what
76:43
would be an acceptable outcome right and
76:48
just eyeballing this we know that the
76:53
command in slot 10 since it's known by
76:55
all all the replicas it could have been
76:59
committed so we cannot throw it away
77:01
similarly the command in slot 11 since
77:04
it's in a majority of the replicas it
77:05
could for all we know have been
77:06
committed so we can't throw it away the
77:09
command in slot 12 however neither of
77:11
them could possibly have been committed
77:13
so we're entitled we don't know haven't
77:16
we'll actually do but raft is entitled
77:18
to drop both of these even though it is
77:21
not entitled to drop it and either of
77:23
the commands in a 10 or 11
77:26
this is entitled dropped it's not
77:28
required to drop either one of them but
77:31
I mean oh it certainly must drop one at
77:33
least one because you have to have
77:35
identical log contents in the end
77:43
this could have been committed it the we
77:47
can't tell by looking at the laws
77:50
exactly how far the leader got before
77:52
crashing so one possibility is that for
77:55
this command or even this command one
77:59
possibility is that leaders send out the
78:00
append messages with a new command and
78:02
then immediately crashed so it never got
78:05
any response back because it crashed so
78:08
the old leader did not know if it was
78:09
committed and if it didn't get a
78:12
response back that means it didn't
78:14
execute it and it didn't send out but
78:17
you know it didn't send out that
78:18
incremented commit index and so maybe
78:22
the replicas didn't execute it either so
78:24
it's actually possible that this wasn't
78:26
committed so even though RAF doesn't
78:29
know it could be legal for raft
78:35
if raft knew more than it does know it
78:40
might be legal to drop this log entry
78:43
because it might not have been committed
78:45
but because on the evidence there's no
78:48
way to disprove it was committed based
78:51
on this evidence it could have been
78:52
committed and raft can't prove it wasn't
78:55
so it must treat it as committed because
78:58
the leader might have received it might
79:01
have crashed just after receiving the
79:03
append entry replies and replying to the
79:05
client so just looking at this we can't
79:08
rule out the possibility that either
79:10
possibility that the leader responded to
79:14
the client in which case we cannot throw
79:15
away this entry because a client knows
79:17
about it or the possibility the leader
79:18
never did and yeah we could you know if
79:23
we have to assume that it was committed
79:33
yeah
79:46
no there's no mañana server crash before
79:51
getting the response it's alright well
79:53
let's continue this on Thursday