字幕記錄


00:02
all right today I want to talk about bit
00:07
more about fault tolerance and
00:08
replication and then look into the
00:10
details of today's paper about vmware ft
00:13
the topics still fault tolerance to
00:16
provide high availability that is you
00:19
want to build a server that even if some
00:20
hardware you know computer crashes is
00:22
involved in the service we still like to
00:24
provide the service and to the extent we
00:27
can we'd like to provide our service
00:29
also if there's network problems and the
00:31
tool we're using its replication least
00:33
for this part of the course so it's
00:36
worth asking what kind of failures
00:39
replication can be expected to deal with
00:42
because it's not everything by any means
00:52
so maybe the easiest way to characterize
00:55
the kind of failures we're talking about
00:57
is fail stop failures of a single
01:01
computer and what I mean by fail stop
01:11
it's a sort of generic term and fault
01:14
tolerance is that if something goes
01:16
wrong would say the computer the
01:18
computer simply stops executing it just
01:23
stops if anything goes wrong and in
01:27
particular it doesn't compute incorrect
01:29
results so if somebody kicks the power
01:31
cable out of your server that's probably
01:35
gonna generate a fail stop failure
01:36
similarly if they unplug your servers
01:41
network connection even though the
01:42
server is still running so this is a
01:44
little bit funny you know be totally cut
01:46
off from the network so it looks at me
01:47
outside like it just stopped so it's
01:50
really these failures we can deal with
01:51
with replication this also covers some
01:55
hardware problems like you know maybe if
01:59
the fan on your server breaks because it
02:02
you know it cost 50 cents maybe that'll
02:04
cause the CPU to overheat and the CPU
02:06
will shut itself down cleanly and just
02:10
stop executing
02:14
what's not covered by the kind of
02:18
replication systems we're talking about
02:20
is things like bugs and software or
02:24
design defects in hardware so basically
02:29
not bugs because if we take some service
02:33
you know say you're a MapReduce master
02:35
for example you know we replicated and
02:37
run it on two computers you know if
02:39
there's a bug in your MapReduce master
02:41
or my MapReduce master let's say
02:43
replications not going to help us we're
02:45
going to compute the same incorrect
02:46
result on both of our copies of our
02:50
MapReduce master and everything looked
02:53
fine they'll agree you just happen to be
02:55
the wrong answer so we can't depending
02:58
against bugs in the replicated software
03:01
and we can't defend against bugs in the
03:02
whatever scheme we're using to manage
03:05
the replication and similarly as I
03:08
mentioned before we can't expect to deal
03:12
with bugs in the hardware the hardware
03:14
it computes incorrectly that's just
03:15
that's the end for us at least with this
03:18
kind of technique although you know that
03:21
said there are definitely hardware and
03:23
software bugs that that replication
03:26
might if you're lucky might be able to
03:28
cope it so if there's some unrelated
03:30
software running in your server and it
03:32
causes the server to crash maybe because
03:34
your kernel to panic and reboot or
03:36
something it has nothing to do with you
03:38
know with your with the service you're
03:40
replicating then that kind of failure
03:42
for us for your service will may well be
03:45
fail stop
03:47
you know the kernel will panic and the
03:50
backup replicas will take over similarly
03:56
some kinds of hardware errors can be
03:59
turned into fail stop errors for example
04:03
if you send a packet over the network
04:05
and the network corrupts it just flips a
04:08
bit in your packet that will almost
04:10
certainly be caught by the checksum on
04:11
the packet same thing for a disk block
04:13
if you write some data to disk and weand
04:15
it back a month later you know maybe the
04:18
magnetic surface isn't perfect and you
04:20
know one of the best couple of bits were
04:22
wrong in the block as it's right back
04:23
it's actually error correcting
04:24
that up to a certain point will fix
04:26
errors in disk blocks that you'll be
04:28
turning you know random hardware errors
04:31
into as either correcting them if you're
04:35
super lucky or at least detecting them
04:39
and turning random corruption into a
04:40
detected fault which you know the
04:43
software then knows that something that
04:45
wrong and can turn it into a fail stop
04:47
fault by stopping executing or take some
04:49
other remedial action but in general we
04:55
really can only expect to handle fail
04:57
stop faults there's other limits to
05:02
replication to you know the the failures
05:06
in the if we have a primary in the back
05:08
of our two replicas or whatever we're
05:10
really assuming that failures in the two
05:12
are independent right if there tend to
05:16
have correlated failures then
05:18
replication is not going to help us so
05:20
for example if we're a big outfit and we
05:22
buy thousands of computers batches of
05:24
thousands of computers identical
05:26
computers from the same manufacturer and
05:28
we run you know our replicas is on all
05:31
on those computers we bought at the same
05:33
time from the same place that's a bit of
05:35
a risk
05:35
maybe because presumably if one of them
05:39
has a manufacturing defect in it there's
05:40
a good chance that the other ones do too
05:43
you know one of them's prone to
05:44
overheating because the manufacturer you
05:47
know didn't provide enough airflow well
05:49
it probably all had that problem and so
05:51
one of them overheats and dies it's a
05:53
good chance that the other ones will too
05:56
so that's one kind of correlated failure
05:59
you just have to be careful of another
06:01
one is that you know if there's an
06:02
earthquake and the city where our
06:03
datacenter is probably gonna take out
06:05
the whole data center you know we can
06:07
have all the replication we like inside
06:08
that data center it's not going to help
06:10
us because the failure caused by an
06:12
earthquake or a citywide power failure
06:14
or something the building burning down
06:15
is like it's correlated failure between
06:17
our replicas if they're on that building
06:19
so if we care about dealing with
06:21
earthquakes then we need to put our
06:24
replicas in maybe in just different
06:26
cities at least physically separate
06:28
enough that they have separate power
06:29
unlikely to be affected by the same
06:31
natural disaster
06:35
okay but that's all sort of hovering in
06:37
the background for this discussion where
06:39
we're talking about the technology you
06:41
might use another question about
06:44
replication is whether it's worthwhile
06:46
you may ask yourself gosh you know this
06:49
literally uses these replication schemes
06:51
use twice as much or three times as much
06:55
computer resources right we need to have
06:57
you know GFS had three copies of every
06:59
box we have to buy three times as much
07:01
disk space the paper for today
07:03
you know replicates just once but that
07:05
means we have twice as many computers
07:07
and CPUs and RAM it's all for expensive
07:10
like is that really worth it that
07:11
expense and you know that's not
07:16
something we can answer technically
07:17
right it's an economic question it
07:19
depends on the value of having an
07:21
available service you know if you're
07:22
running a bank and if the consequence is
07:26
the computer failing is that your
07:27
customer you can't serve your customers
07:29
and you can't generate revenue and your
07:31
customers all hate you then it may well
07:33
be worth it to blow you know an extra
07:35
ten or twenty thousand bucks on a second
07:37
computer so you can have a replica on
07:40
the other hand if you're me and you're
07:41
running the 684 web server I don't
07:46
consider it worthwhile to have a hot
07:48
backup of the 84 web server because the
07:50
consequences of failure are very low so
07:55
the whether the replication is
07:58
worthwhile on how many replicas you
08:00
ought to have and how much you're
08:02
willing to spend on it is all about how
08:04
much cost and inconvenience failure
08:08
would call it cause you all right this
08:11
paper sort of in the beginning mentions
08:14
as there's a couple of different
08:16
approaches to replication really
08:19
mentions two one two calls state
08:21
transfer and the other calls replicated
08:30
state machine most of the schemes we're
08:35
going to talk about in this class are
08:36
replicated state machines
08:39
it'll talk about both anyway the idea
08:42
behind state transferor's that if we
08:44
have two replicas of a server the way
08:49
you cause them to be to stay in sync
08:51
that is to be actual replicas so that
08:55
the backup can has everything it needs
08:57
to take over if the primary fails in a
08:59
state transfer scheme the way that works
09:02
is that the primary sends a copy of its
09:04
entire state that is for example the
09:06
contents of its RAM to the backup and
09:09
the backup just sort of stores the
09:11
latest state and so it's all there
09:13
the primary fails in the backup can
09:15
start executing with this last state it
09:18
got if the primary fails so this is all
09:20
about sending the state of the of the
09:23
primary and for today's if today's paper
09:26
worked as a state transfer system which
09:28
it doesn't then the state we'd be
09:29
talking about would be the contents of
09:31
the RAM the contents of the memory of
09:33
the primary so maybe every once while
09:35
the primary would just you know make a
09:38
big copy of its memory and send it
09:39
across the network to the backup you can
09:41
imagine if you wanted to be efficient
09:42
you know maybe you would only send the
09:44
parts of the memory that it's changed
09:45
since the last time you sent in memory
09:48
to the backup the replicated state
09:51
machine
09:52
this approach observes that most
09:55
services are most computer things we
09:57
want to replicate have some internal
10:00
operation that's deterministic except
10:05
when external input comes in right you
10:08
know ordinarily if there's no external
10:10
influences on a computer it just
10:12
executes one instruction after another
10:14
and what each instruction does is a
10:16
deterministic function of what's in the
10:18
memory and the registers of the computer
10:20
and it's only when external events
10:22
intervene that something unexpected may
10:25
happen like a packet arrives of a some
10:27
random time and that causes the server
10:31
to start doing something differently I'm
10:33
so replicated state machine schemes
10:36
don't send the state between the
10:39
replicas instead they just send those
10:41
external events they just send maybe
10:45
from a primary to a backup again just
10:47
send things like arriving input from the
10:50
outside world that the backup needs to
10:52
know
10:52
and the observation is that you know if
10:55
you have to two computers and they start
10:57
from the same state and they see the
11:00
same inputs that that in the same order
11:03
or at the same time the two computers
11:05
will continue to be replicas of each
11:08
other and sort of execute identically as
11:10
long as they both see the same inputs at
11:12
the same time so this transfers probably
11:17
memory and this transfer some primary
11:21
backup just operations from clients or
11:25
external external inputs or external
11:27
events and you know the reason why
11:33
people tend to favor a replicated state
11:35
machine is that usually operations are
11:39
smaller than the state but this you know
11:41
the state of a server if it's a database
11:43
server might be the entire database
11:44
might be you know gigabytes whereas the
11:47
operations are just some clients sending
11:48
and you know please read or write key 27
11:51
operations are usually small the states
11:53
usually large so replicate a state
11:55
machine usually looks attractive and
11:57
slight downside is that the schemes tend
11:59
to be quite a bit more complicated and
12:01
rely on sort of more assumptions about
12:05
how the computers operate whereas this
12:08
is a really heavy-handed I'm just gonna
12:10
send you my whole state sort of a
12:12
nothing to worry about
12:14
any questions about these strategies yes
12:27
well the did ok so the question is
12:30
suppose something went wrong with our
12:32
scheme and the backup was not actually
12:34
identical to the primary so you know
12:40
you're suppose we were running GFS
12:44
master and it's the primary it just
12:47
handed out at least two chunks server
12:49
one but because the two you know because
12:55
we've allowed the states of the primary
12:56
back to drift out of sync the backup did
12:59
not issue at least to anybody it wasn't
13:01
even away or anybody had asked for these
13:02
so now the primary thinks you know
13:04
chunks everyone has lease for some chunk
13:05
in the backup doesn't the primary fails
13:08
backup takes over right now chunks over
13:11
one thinks it has a lease for some chunk
13:13
but then the current master doesn't and
13:17
is happy to hand out the lease to some
13:19
other trunk server now we have to chunk
13:20
servers serving the same beasts okay so
13:23
that's just a close to home example but
13:25
really you know almost any bad thing and
13:28
kind of I think you construct any bad
13:30
scenario by just imagining some service
13:32
that confuse the wrong answer because
13:35
the state's leverage
13:42
so you're asking about randomization
13:50
yeah oh y'all talk about this I'll talk
13:53
about this a bit later on but it is good
13:55
that the replicated state scheme
13:58
definitely makes the most sense when the
14:02
instructions that the primary in the
14:04
back of our executing do the same thing
14:06
as long as there's no external events
14:08
right and that's almost true right you
14:10
know for an add instruction or something
14:12
yeah you know if the starting if the
14:14
registers of memory of the same and they
14:16
both execute an add instruction they had
14:17
insurgents kind of the same inputs in
14:19
the same outputs but they're in some
14:20
instructions as you point out that don't
14:22
like maybe there's an instruction that
14:23
gets the current time of day now
14:26
probably be executed at slightly
14:27
different times or an instruction that
14:29
gets the current processors unique ID
14:32
and a serial number it's going to yield
14:34
the different answers and the the the
14:38
uniform answered the questions that
14:39
sound like this is that the primary does
14:42
it and sends the answer to the backup
14:44
and the backup does not execute that
14:46
instruction but instead at the point
14:48
where it would execute that instruction
14:50
it listens for the primary to tell it
14:52
what the right answer would be and just
14:54
sort of fakes that answer to the
14:56
software I'll talk about you know how
15:00
the VMware scheme does that okay
15:04
interestingly enough though today's
15:06
paper is all about a replicated state
15:09
machine you may have noticed that
15:12
today's paper only deals with you know
15:13
processors and it's not that clear how
15:15
it could be extended to a multi-core and
15:18
a multi-core machine where the
15:21
interleavings of the instructions from
15:23
the two cores organ are
15:24
non-deterministic all right so we no
15:26
longer have this situation on a
15:27
multi-core machine where if we just let
15:29
the primary and backup execute they're
15:31
you know all else being equal they're
15:32
going to be the same because they won't
15:34
execute on multiple cores VMware has
15:37
since come out with a new possibly
15:39
completely different replication system
15:42
that does work on multi-core and the new
15:44
system appears to me to be using state
15:46
transfer instead of replicated state
15:49
machine because state transferred is
15:50
more robust in the face
15:53
multi-core and parallelism if you use
15:56
the machine and send the memory over you
15:58
know that the memory image is just that
16:00
just is the state of the machine and
16:02
sort of it doesn't matter that there was
16:04
parallelism whereas the replicated state
16:06
machine scheme really has a problem with
16:08
the parallelism you know on the other
16:12
hand I'm guessing that this new
16:15
multi-core scheme is more expensive okay
16:21
all right so if we want to build a
16:24
replicated state machine scheme we got a
16:26
number of questions to answer so we need
16:31
to decide at what level we're gonna
16:32
replicate state right so what state what
16:36
do we mean by state we have to worry
16:44
about how how closely synchronized the
16:47
primary and backup have to be right
16:49
because it's likely the primary will
16:51
execute a little bit ahead of the backup
16:53
after all it it's the primary that sees
16:55
the inputs so the backup almost
16:57
necessarily must lag over that gives
17:00
that means there's an opportunity if the
17:01
primary fails for the prime for the
17:04
backup not to be fully caught up having
17:08
the backup actually executes really in
17:11
lockstep with the primaries for
17:12
expensive because it requires a lot of
17:14
chitchat so a lot of designs a lot of
17:16
what people sweat about is how close the
17:19
synchronization is if the primary fails
17:27
or you know actually if the backup fails
17:29
to but it's more exciting if the primary
17:31
fails there has to be some scheme for
17:33
switching over and the clients have to
17:34
know oh gosh I instead of talking to the
17:37
old primary on server one I should now
17:39
be talking to the
17:44
the backup on server to all the clients
17:47
have to somehow figure this out the
17:50
switch over almost certainly it's almost
17:53
impossible maybe impossible to design a
17:55
cut over system in which no anomalies
17:58
are every are ever visible you know in
18:00
this sort of ideal world if the primary
18:03
feels we'd like nobody to ever notice
18:05
none of the clients to notice turns out
18:07
that's basically unattainable so there's
18:10
going to be a mama leaves during the cut
18:15
over and we've gotta figure out a way to
18:16
cope with them and finally if the one of
18:19
the two if one of our replicas fails we
18:21
really need to have a new replica right
18:23
if we have a two replicas and one fails
18:26
we're just living on borrowed time right
18:29
because the second replica may fail at
18:31
some point so we absolutely need to get
18:33
a new replica back online as fast as
18:36
possible so and that can be very
18:41
expensive the state is big you know you
18:44
know but the reason we like to replicate
18:45
a state machine was because we thought
18:47
state transfer would be expensive but
18:49
the two replicas in a replicated state
18:51
machine still need to have full state
18:53
right we just had a cheap way of keeping
18:55
them both in sync if we need to create a
18:57
new replica we actually have no choice
18:59
but state transfer to create the new
19:01
replicas the new replica needs to have a
19:03
complete copy of the state so it's going
19:06
to be expensive to create new replicas
19:08
and this is often people spending well
19:15
actually people spend a lot of time
19:16
worrying about all these questions and
19:18
you know we'll see them again as we look
19:20
at other replicated state machine
19:22
schemes so on the topic of what state to
19:29
replicate the today's paper has a very
19:33
interesting answer to this question it
19:35
replicates the full state of the machine
19:38
that is all of memory and all the
19:42
Machine registers it's like a very very
19:45
detailed replication scheme just no
19:48
difference at the even of the lowest
19:51
levels between the primary in the backup
19:52
that's quite rare for replication
19:55
schemes
19:56
almost always you see something that's
19:58
more like GFS where GFS absolutely did
20:01
not replicate you know they had
20:03
replication but it wasn't replicating
20:05
every single you know bit of memory
20:08
between the primaries and the backups
20:10
it was replicating much more application
20:12
level table of chunks
20:14
I had this abstraction of you know
20:16
chunks and chunk identifiers and that's
20:18
what it was replicating it wasn't
20:20
replicating sort of everything else
20:22
wasn't going to the expense of
20:24
replicating every single other thing
20:26
that machines we're doing okay as long
20:28
as they had the same sort of application
20:31
visible set of of chunks so most
20:37
replication schemes out there go the GFS
20:40
route in fact almost everything except
20:42
pretty much this paper and a few handful
20:46
of similar systems almost everything
20:48
uses application at some level
20:50
application level of replication because
20:53
it can be much more efficient because we
20:56
don't have to go to the we don't have to
20:58
go to the trouble of for example making
21:00
sure that interrupts occur at exactly
21:02
the same point in the execution of the
21:04
primary and backup GFS does not sweat
21:07
that at all but this paper has to do
21:09
because it replicates at such a low
21:11
level so most people build efficient
21:14
systems with applications specific
21:16
replication the consequence of that
21:18
though is that the replication has to be
21:20
built into the right into the
21:21
application right if you're getting a
21:23
feed of application level operations for
21:26
example you really need to have the
21:28
application participate in that because
21:31
some generic replication thing like
21:33
today's paper
21:34
doesn't really can't understand the
21:37
semantics of what needs to be replicated
21:41
so anyways so most teams are application
21:44
specific like GFS and every other paper
21:47
we're going to read on this topic
21:49
today's paper is unique in that it
21:52
replicates at the level of the machine
21:54
and therefore does not care what
21:55
software you run on it right it
21:57
replicates the low-level memory and
22:00
machine registers you can run any
22:01
software you like on it as long as it
22:03
runs on that kind of microprocessor
22:05
that's being represented this
22:06
replication scheme applies to the
22:08
software can be anything
22:10
and you know the downside is that it's
22:14
not that efficient necessarily the
22:16
upside is that you can take any existing
22:18
piece of software maybe you don't even
22:20
have source code for it or understand
22:21
how it works and you know do within some
22:26
limits you can just run it under this
22:27
under VMware this replication scheme and
22:29
it'll just work which is sort of magic
22:33
fault-tolerance wand for arbitrary
22:36
software all right now let me talk about
22:44
how this is VMware or Ft first of all
22:51
VMware is a virtual machine company
22:53
they're what their business is a lot of
22:56
their business is selling virtual
22:58
machine technology and what virtual
23:00
machines refer to is the idea of you
23:04
know you buy a single computer and
23:07
instead of booting an operating system
23:09
like Linux on the hardware you boot
23:12
we'll call a virtual machine monitor or
23:16
hypervisor on the hardware and the
23:18
hypervisor is job is actually to
23:19
simulate multiple multiple computers
23:24
multiple virtual computers on this piece
23:27
of hardware so the virtual machine
23:28
monitor may boot up you know one
23:31
instance of Linux may be multiple
23:34
instances of Linux may be a Windows
23:37
machine you can the virtual machine
23:40
monitor on this one computer can run a
23:42
bunch of different operating systems you
23:45
know each of these as is itself some
23:49
sort of operating system kernel and then
23:51
applications so this is the technology
23:55
they're starting with and you know the
23:58
reason for this is that if you know you
24:00
need to it just turns out there's many
24:03
many reasons why it's very convenient to
24:04
kind of interpose this level of
24:06
indirection between the hardware and the
24:08
operating systems and means that we can
24:10
buy one computer and run lots of
24:11
different operating systems on it we can
24:14
have each if we run lots and lots of
24:16
little services instead of having to
24:18
have lots and lots of computers one per
24:19
service you can just buy one computer
24:21
and run each service in the operate
24:23
system that it needs I'm using his
24:25
personal machines so this was their
24:28
starting point they already had this
24:29
stuff and a lot of sophisticated things
24:31
built around it at the start of
24:35
designing vmware ft so this is just
24:38
virtual machines um what the papers
24:43
doing is that it's gonna set up one
24:46
machine or they did requires two
24:51
physical machines because there's no
24:54
point in running the primary and backup
24:57
software in different virtual machines
24:59
on the same physical machine because
25:01
we're trying to guard against hardware
25:03
failures so you're gonna to at least you
25:06
know you have two machines running their
25:08
virtual machine monitors and the primary
25:15
it's going to run on one the backups and
25:16
the other so on one of these machines we
25:18
have a guest you know we only it might
25:23
be running a lot of virtual machines we
25:25
only care about one of them it's gonna
25:26
be running some guest operating system
25:28
and some sort of server application
25:32
maybe a database server MapReduce master
25:35
or something so I'll call this the
25:37
primary and there'll be a second machine
25:40
that you know runs the same virtual
25:43
machine monitor and an identical virtual
25:47
machine holding the backup so we have
25:49
the same whatever the operating system
25:50
is exactly the same and the virtual
25:55
machine is you know giving these guest
25:58
operating systems the primary and backup
26:00
a each range of memory and this memory
26:02
images will be identical or the goal is
26:04
to make them identical in the primary in
26:07
the backup we have two physical machines
26:09
each one of them running a virtual
26:13
machine guest with a its own copy of the
26:15
service we care about we're assuming
26:17
that there's a network connecting these
26:22
two machines and in addition on this
26:25
local area network in addition on this
26:27
network there's some set of clients
26:29
really they don't have to be clients
26:30
they're just maybe other computers that
26:33
our replicated service needs to talk
26:35
with some of them our clients
26:37
sending requests it turns out in this
26:39
paper there the replicated service
26:44
actually doesn't use a local disk and
26:47
instead assumes that there's some sort
26:49
of disk server that it talks to him
26:53
although it's a little bit hard to
26:55
realize this from the paper the scheme
26:59
actually does not really treat the de
27:01
server particularly especially it's just
27:04
another external source of packets and
27:07
place that the replicated state machine
27:09
may send packets do not very much
27:12
different from clients okay so the basic
27:17
scheme is that the we assume that these
27:20
two replicas the two virtual machines
27:24
primary and backup are our exact
27:27
replicas some client you know database
27:30
client who knows who has some client of
27:31
our replicated server sends a request to
27:33
the primary and that really takes the
27:37
form of a network packet that's what
27:38
we're talking about that generates an
27:40
interrupt
27:41
and this interrupts actually goes to the
27:43
virtual machine monitor at least in the
27:45
first instance the virtual machine
27:47
monitor sees a hot here's the input for
27:50
this replicated service and so the
27:54
virtual machine monitor does two things
27:55
one is it sort of simulates a network
27:58
packet arrival interrupt into the
28:01
primary guest operating system to
28:04
deliver it to the primary copy of the
28:07
application and in addition the virtual
28:09
machine monitor you know knows that this
28:11
is an input to a replicated virtual
28:13
machine and it's so it sends back out on
28:15
the network a copy of that packet to the
28:19
backup virtual machine monitor it's also
28:23
guessing and backup virtual machine
28:26
monitor knows a hot is a packet for this
28:28
particular replicated state machine and
28:30
it also fakes a sort of network packet
28:34
arrival interrupt at the backup and
28:36
delivers the packet so now both the
28:39
primary and the back have a copy this
28:40
packet they looks at the same input you
28:42
know with a lot of details are gonna
28:44
process it in the same way and stay
28:48
synchronized
28:50
course the service is probably going to
28:52
reply to the client on the primary the
28:55
service will generate a reply packet and
28:58
send it on the NIC that the virtual
29:02
machine monitor is emulating and then
29:06
the virtual machine monitor or will
29:07
we'll see that output packet on the
29:09
primary they'll actually send the reply
29:11
back out on the network to the client
29:13
because the backup is running exactly
29:16
the same sequence of instructions it
29:17
also generates a reply packet back to
29:20
the client and sends that reply packet
29:23
on its emulated NIC it's the virtual
29:27
machine monitor that's emulating that
29:28
network interface card and it says aha
29:31
you know the virtual machine monitor
29:32
says I know this was the backup only the
29:34
primary is allowed to generate output
29:35
and the virtual machine monitor drops
29:39
the reply packet so both of them see
29:42
inputs and only the primary generates
29:44
outputs as far as terminology goes the
29:53
paper calls this stream of input events
29:59
and other things other events we'll talk
30:01
about from the stream is called the
30:04
logging Channel it all goes over the
30:06
same network presumably but these events
30:10
the primary since the back of our called
30:12
log events on the log Channel
30:22
where the fault tolerance comes in is
30:24
that those the primary crashes what the
30:29
backup is going to see is that it stops
30:31
getting stuff on the stops getting log
30:34
entries a log entry stops getting log
30:37
entries on the logging channel and we
30:42
know it it turns out that the backup can
30:45
expect to get many per second because
30:47
one of the things that generates log
30:49
entries is periodic timer interrupts in
30:52
the in the primary each one of which
30:55
turns out every interrupt generates a
30:57
log entries into the backup these timer
30:59
interrupts are going to happen like 100
31:01
times a second so the backups can
31:02
certainly expect to see
31:04
a lot of chitchat on the logging Channel
31:07
if the primaries up if the primary
31:09
crashes then the virtual machine
31:11
monitored over here will say gosh you
31:12
know I haven't received anything on the
31:14
logging channel for like a second or
31:15
however long the primary must be dead or
31:19
or something and in that case when the
31:25
backup stop seeing log entries from the
31:28
primary the paper the way the paper
31:31
freezes it is that the back of goes live
31:33
and what that means is that it stops
31:35
waiting for these input events on the
31:42
logging Channel from the primary and
31:46
instead this virtual machine monitor
31:49
just lets this backup execute freely
31:51
without waiting for without being driven
31:54
by input events from the primary the vmm
31:59
does something to the network to cause
32:00
future client requests to go to the
32:02
backup instead of the primary and the
32:05
VMM here stops discarding the backup
32:09
personnel it's the primary not the
32:11
backup stops discarding output from this
32:13
virtual machine so now this or machine
32:15
directly gets the inputs and there's a
32:18
lot of produce output and now our backup
32:20
is taken over and similarly you know
32:22
that this is less interesting but has to
32:25
work correctly
32:26
if the backup fails a similar primary
32:29
has to use a similar process to abandon
32:31
the backup stop sending it events and
32:34
just sort of act much more like a single
32:37
non replicated server so either one of
32:39
them can go live if the other one
32:41
appears to be dead stops you know stops
32:43
generating network traffic
32:51
magic now it depends you know depends on
32:57
what the networking technology is I
33:01
think with the paper one possibility is
33:04
that this is sitting on Ethernet every
33:07
physical computer on the Internet or
33:09
really every NIC has a 48 bit unique ID
33:16
I'm making this up now the it could be
33:21
that in fact instead of each physical
33:22
computer having a unique ID each virtual
33:25
machine does and when the backup takes
33:30
over it essentially claims the primary's
33:36
Ethernet ID as its own and it starts
33:39
saying you know I'm the owner of that ID
33:41
and then other people on the ethernet
33:42
will start sending us packets that's my
33:46
interpretation the designers believed
34:02
they had identified all such sources and
34:04
for each one of them the primary does
34:07
whatever it is you know executes the
34:10
random number generator instruction or
34:12
takes an interrupt at some time the
34:14
backup does not and the back of virtual
34:17
machine monitor sort of detects any such
34:19
instruction and and intercepts that and
34:22
doesn't do it and he said the backup
34:24
wheats for an event on the logging
34:26
Channel saying this instruction number
34:28
you know the random number was whatever
34:30
it was on the primary
34:35
Edwige
34:37
yes yes
34:42
yeah the paper hints that they got Intel
34:46
to add features to the microprocessor to
34:50
support exactly this but they don't say
34:54
what it was okay
35:04
okay so on that topic the so far that
35:08
you know the story is sort of assumed
35:09
that as long as the backup to sees the
35:16
package from the clients it'll execute
35:17
in identically to the primary and that's
35:21
actually glossing over some huge and
35:25
important details so one problem is that
35:30
as a couple of people have mentioned
35:31
there are some things that are
35:33
non-deterministic now it's not the case
35:36
that every single thing that happens in
35:37
the computer is a deterministic function
35:39
of the contents of the memory of the
35:41
computer it is for a sort of straight
35:44
line code execution often but certainly
35:46
not always so worried about is things
35:49
that may happen that are not a strict
35:51
function of the current state that is
35:53
that might be different if we're not
35:54
careful on the primary and backup so
35:56
these are sort of non-deterministic
35:58
events that may happen so the designers
36:04
had to sit down and like figure out what
36:05
they all work and here are the ones
36:10
here's the kind of stuff they talked
36:12
about so one is inputs from external
36:16
sources like clients which arrive just
36:18
whenever they arrive right they're not
36:20
predictable there are no sense in which
36:21
the time at which a client request
36:24
arrives or its content is a
36:25
deterministic function of the services
36:27
state because it's not so these actually
36:31
this system is really dedicated to a
36:34
world in which services only talk over
36:37
the network and so the only really
36:39
basically the only form of input or
36:41
output in this system is supported by
36:44
this system seems to be network packets
36:46
coming and going so we didn't put
36:48
arrives at what that really means it's a
36:50
packet
36:50
arrives and what a packet really
36:53
consists of for us is the data in the
36:56
packet plus the interrupt
37:01
that's signaled that the packet had
37:05
arrived so that's quite important so
37:08
when a packet arrives
37:11
I'm ordinarily the NIC DMA is the packet
37:16
contents into memory and then raises an
37:20
interrupt which the operating system
37:22
feels and the interrupt happens at some
37:23
point in the instruction stream and so
37:26
both of those have to look identical on
37:29
the primary and backup or else we're
37:30
gonna have they're also executions gonna
37:33
diverge and so you know the real issue
37:35
is when the interrupt occurs exactly at
37:38
which instruction the interrupts happen
37:40
to occur and better be the same on the
37:42
primary in the backup otherwise their
37:44
execution is different and their states
37:46
are gonna diverge and so we care about
37:49
the content of the packet and the timing
37:50
of the interrupt and then as a couple of
37:54
people have mentioned there's a few
37:56
instructions that that behave
38:04
differently on different computers or
38:06
differently depending on something like
38:09
there's maybe a random number generator
38:11
instruction there's I get time-of-day
38:13
instructions that will yield different
38:15
answers have called at different times
38:16
and unique ID instructions another huge
38:21
source of non determinism which the
38:22
paper basically rules out is multi-core
38:27
parallelism is a unit process or only
38:33
system there's no multi-core in this
38:34
world the reason for this is that if it
38:36
allowed multi-core then then the service
38:40
would be running on multiple cores and
38:41
the instructions of the service the rest
38:45
of you know the different cores are
38:45
interleaved in some way which is not
38:48
predictable and so really if we run the
38:50
same code on the on the backup in the
38:52
server if it's parallel code running on
38:54
a multi-core the tubo interleave the
38:56
instructions in the two cores in
38:58
different ways the hardware will and
39:00
that can just cause
39:03
different results because you know
39:05
supposing the code and the two cores you
39:08
know they both asked for a lock on some
39:10
data well on the master you know
39:13
core one may get the lock before Core 2
39:15
on the slave just because of a tiny
39:17
timing difference core to may got the
39:19
lock first and the you know execution
39:21
results are totally different likely to
39:23
be totally different if different
39:25
threads get the lock
39:26
so multi-core is the grim source of
39:30
non-determinism man is just totally
39:32
outlawed in this papers world and indeed
39:36
like as far as I can tell the techniques
39:39
are not really applicable the service
40:00
can't use multi-core parallel
40:01
parallelism the hardware is almost
40:04
certainly multi-core parallel but that's
40:06
the hardware sitting underneath the
40:09
virtual machine monitor the machine that
40:11
the virtual machine monitor exposes to
40:13
one of the guest operating systems that
40:15
runs the primary backup that emulated
40:18
virtual machine is a unicorn it's a unit
40:21
processor machine in this paper and I'm
40:25
guessing there's not an easy way for
40:26
them to adapt this design to multi-core
40:31
virtual machines
40:39
okay so so these are really it's it's
40:43
it's these events that go over the
40:44
logging channel and so the format of a
40:49
log record a log log entry they don't
40:55
quite say but I'm guessing that there's
40:57
really three things in a log entry
40:58
there's the instruction number at which
41:01
the event occurred because if you're
41:02
delivering an interrupt or you know
41:04
input or whatever it better be delivered
41:06
at exactly the same place in the primary
41:09
backup so we need to know the
41:10
instruction number and by instruction
41:11
number I mean you know the number of
41:14
instructions since the Machine booted
41:15
why not the instruction address but like
41:18
oh or executing the four billion and
41:20
79th instructions since boot so log
41:23
entry is going to have instruction
41:24
number four an interrupt for input it's
41:31
going to be the instruction at which the
41:34
interrupt was delivered on the primary
41:35
and for a weird instruction like get at
41:39
time of day it's going to be the
41:41
instruction number of the instruction of
41:43
the get time of day or whatever
41:44
instruction that was executed on the
41:46
primary so that you know the backup
41:49
knows where to where to call this event
41:52
to occur okay so there's gonna be a type
41:54
you know network input whatever a weird
41:58
instruction and then there's I'm gonna
42:00
be data for a packet arrival it's gonna
42:03
be the packet data for one of these
42:05
weird instructions it's going to be the
42:06
result of the instruction when it was
42:08
executed on the primary so that the
42:10
backup virtual machine can sort of fake
42:13
the instruction and supply that same
42:15
result
42:22
okay so so as an example the both of
42:27
these operating systems guest operating
42:34
system assumes requires that the
42:38
hardware in this case emulated hardware
42:40
virtual machine has a timer that ticks
42:42
say a hundred times a second and causes
42:44
interrupts to the operating system and
42:47
that's how the operating system keeps
42:49
track of time it's by counting these
42:51
timer interrupts so the way that plays
42:54
out those timer notice why they have to
42:56
happen at exactly the same place in the
42:58
primary and backup otherwise they don't
43:00
execute the same no diverge so what
43:04
really happens is that the there's
43:06
there's a timer on the physical machine
43:10
that's running the Ft virtual machine
43:14
monitor and the timer on the physical
43:16
machine ticks and delivers an interrupt
43:18
a timer and up to the virtual machine
43:19
monitor on the primary the virtual
43:23
machine monitor at you know the
43:24
appropriate moment stops the execution
43:29
of the primary writes down the
43:31
instruction number that it was at you
43:34
know instruction since boot and then
43:37
delivers sort of fake simulates and
43:39
interrupts into the guest operating
43:41
system in the primary at that
43:43
instruction number saying oh you know
43:44
you're emulating the timer Hardware just
43:46
ticked
43:47
there's the interrupt and then the
43:49
primary virtual machine monitor sends
43:51
that instruction number which the
43:52
interrupt happened you know to the
43:54
backup the backup of course it's virtual
43:59
machine monitor is also taking timer
44:00
interrupts from its physical timer and
44:02
it's not giving them it's not giving
44:04
it's a real physical timer interrupts to
44:06
the to the backup operating system it's
44:10
just ignoring them when the law when the
44:13
log entry for the primaries timer
44:15
interrupts arrives here then the backup
44:18
virtual machine monitor will arrange
44:20
with the CPU and this requires special
44:22
CPU support to cause the physical
44:28
machine to interrupt at the same
44:30
instruction number
44:32
at the timer interrupts tapped into the
44:34
primary at that point the virtual
44:36
machine monitor gets control again from
44:38
the guest and then fakes the timer
44:41
interrupts into the backup operating
44:43
system now exact exactly the same
44:46
instruction number as it occurred on the
44:47
primary well yeah so the observation is
45:17
that this will this relies on the CPU
45:18
having some special hardware in it where
45:20
the vmm can tell the hardware CPU please
45:24
interrupt a thousand instructions from
45:26
now and then the vmm you know where so
45:29
that you know it'll interrupt at the
45:32
right instruction number the same
45:34
instruction as the primary did and then
45:35
the vmm just tells the cpu to start X
45:38
resume executing again in the backup and
45:40
exactly a thousand instructions later
45:42
the CPU will force an interrupt into the
45:44
virtual machine monitor and that that's
45:46
special hardware but it turns out it's
45:48
you know on all Intel chips so it's not
45:51
it's not that special anymore you know
45:53
15 years ago it was exotic now it's
45:56
totally normal and it turns out there's
45:59
a lot of other uses for it like um if
46:01
you want to do profiling you wanna do
46:02
CPU time profiling what you'd really
46:04
like or one way to do CPU time profiling
46:07
is to have the microprocessor interrupt
46:09
every thousand instructions right and
46:11
this is the hardware that's this
46:13
Hardware also this is the same hardware
46:15
that would cause the microprocessor to
46:17
generate an interrupt every thousand
46:18
instructions so it's a very natural sort
46:21
of gadget to want in your CPU
46:31
all right yes
46:54
what if the backup gets ahead of the
46:56
primary so you know we standing above
46:59
know that oh you know the primary is
47:02
about to take an interrupt at the
47:04
millionth instruction but the backup is
47:08
already you know executed the millionth
47:11
and first instruction so it's gonna be
47:14
if we let this happen it's gonna be too
47:16
late to deliver the interrupts if we let
47:19
the backup xu the head of the primary
47:21
it's going to be too late to deliver the
47:23
interrupts at the same point in the
47:26
primary instruction stream and the
47:27
backup of the instruction stream so we
47:29
cannot let that happen we cannot let the
47:31
backup get ahead of the primary in
47:33
execution and the way VMware aft does
47:37
that is that the the backup virtual
47:45
machine monitor it actually keeps a
47:46
buffer of waiting events that have
47:49
arrived from the primary and it will not
47:53
let to the backup execute unless there's
47:56
at least one event in that buffer and if
47:58
there's one event in that buffer then it
48:01
will know from the instruction number
48:02
the place at which it's got a force the
48:07
backup to stop executing so always
48:10
always the backup is executing with the
48:14
CPU being told exactly where the next
48:17
stopping point the next instruction
48:19
number of a stopping point is because
48:21
the backup only executes if it has a an
48:24
event here that tells it where to stop
48:26
next so that means it starts up after
48:30
the primary because the backup can't
48:31
even start executing until the primary
48:33
has generated the first event and that
48:35
event has arrived at the backup so the
48:37
backup sort of always one event
48:39
basically behind the at least one event
48:41
behind the primary and if it's slower
48:43
for some other whatever reason maybe
48:44
there's other stuff running on that
48:45
physical machine then the backup might
48:47
get you know multiple events behind at
48:50
the primary
48:58
alright there's a one little piece of
49:03
mess about arriving the specific case of
49:05
arriving packets ordinarily when a
49:16
packet arrives from a network interface
49:17
card if we weren't running a virtual
49:19
machine the network interface card would
49:22
DMA the packet content into the memory
49:24
of the computer that it's attached to
49:27
sort of as the data arrives from the
49:30
network interface card and that means
49:33
you know you should never write software
49:34
like this but it could be that the
49:38
operating system that's running on a
49:39
computer might actually see the data of
49:41
a packet as its DMA or copied from the
49:44
network interface card into memory right
49:46
you know this is and you know we don't
49:49
know what operating this system is
49:51
designed so that it can support any
49:52
operating system and cost maybe there is
49:53
an operating system that watches
49:56
arriving packets in memory as they're
49:58
copied into memory so we can't let that
50:01
happen because if the primary happens to
50:04
be playing that trick it's gonna see you
50:08
know if we allowed the network interface
50:10
card to directly DMA incoming packets
50:13
into the memory of the primary the
50:15
primary we don't have any control over
50:17
the exact timing of when the network
50:20
interface card copies data into memory
50:22
and so we're not going to know sort of
50:24
at what times the primary did or didn't
50:28
observe data from the packet arriving
50:32
and so what that means is that in fact
50:34
the NIC copies incoming packets into
50:39
private memory of the virtual machine
50:41
monitor and then the network interface
50:43
card interrupts the virtual machine
50:45
monitor and says oh a packet has arrived
50:46
at that point the virtual machine
50:48
monitor will suspend the primary and
50:51
remember what instruction number had
50:52
suspended at copy the entire packet into
50:56
the primaries memory while the primary
50:57
suspended and not looking at this copy
51:00
and then emulate a network interface
51:03
card interrupt into the primary
51:05
and then send the packet and the
51:10
instruction number to the backup the
51:13
backup will also suspend the backup rope
51:16
you know virtual machine monitor will
51:17
spend the backup at that instruction
51:18
number copy the entire packet and again
51:21
to the back-up is guaranteed not to be
51:23
watching the data arrive and then fakin
51:25
interrupts at the same instruction
51:27
numbers the primary and this is the
51:30
something the bounce buffer mechanism
51:34
explained in the paper okay yeah the the
51:57
only instructions and that result in
51:59
logging channel traffic or are weird
52:03
instructions which are rare no its
52:06
instructions that might yield a
52:09
different result if executed on the
52:10
primary and backup like instruction to
52:12
get the current time of day or current
52:14
processor number or ask how many
52:15
instructions have been executed or and
52:18
those actually turn out to be relatively
52:19
rare there's also one them to get random
52:22
tasks when some machines to ask or a
52:24
hardware generated random number for
52:25
cryptography or something and but those
52:28
are not everyday instructions most
52:30
instructions like add instructions
52:31
they're gonna get the same result on
52:33
primary and that go
52:44
yeah so the way those get replicated on
52:47
the back up is just by forwarding that's
52:51
exactly right each network packet just
52:52
it's packaged up and forwarded as it is
52:55
as a network packet and is interpreted
52:57
by the tcp/ip stack on both you know so
53:02
I'm expecting 99.99% of the logging
53:07
channel traffic to be incoming packets
53:09
and only a tiny fraction to be results
53:12
from special non-deterministic
53:14
instructions and so we can kind of guess
53:17
what the traffic load is likely to be
53:20
for for a server that serves clients
53:22
basically it's a copy of every client
53:24
packet and then we'll sort of know what
53:27
the logging channel how fast the logging
53:29
channel has to be all right so um so
53:40
it's worth talking a little bit about
53:42
how output works and in this system
53:44
really the only what output basically
53:46
means only is sending packets that
53:49
client send requests in as network
53:51
packets the response goes back out as
53:54
network packets and there's really no
53:56
other form of output as I mentioned the
54:00
you know both primary and backup compute
54:02
the output packet they want to send and
54:04
that sort of asks that simulated mix to
54:06
send the packet it's really sent on the
54:08
primary and simply discard it the output
54:10
packet discarded on the backup okay but
54:15
it turns out is a little more
54:17
complicated than that so supposing we're
54:21
what we're running is a some sort of
54:24
simple database server and the operation
54:27
the client operation that our database
54:28
server supports is increment and ideas
54:31
the client sends an increment requests
54:33
the database server increments the value
54:36
and sends back the new value so maybe on
54:39
the primary well let's say everything's
54:41
fine so far and the primary backup both
54:43
have value 10 in memory and that's the
54:47
current value at the counter and some
54:51
client on the local area network sends a
54:53
you know an increment request to
54:58
the primary that packet is you know
55:03
delivered to the primary it's you know
55:04
it's executed the primary server
55:07
software and the primary says oh you
55:08
know current values 10 I'm gonna change
55:10
to 11 and send a you know response
55:13
packet back to the client saying saying
55:16
11 as their reply the same request as I
55:20
mentioned gonna supposed to be sent to
55:22
the backup will also be processed here
55:25
it's going to change this 10 to 11 also
55:26
generate a reply and we'll throw it away
55:28
that's what's supposed to happen the
55:30
output however you also need to ask
55:33
yourself what happens if there's a
55:37
failure at an awkward time if you should
55:39
always in this class should always ask
55:42
yourself what's the most awkward time to
55:44
have a failure and what would happen you
55:46
to failure occurred then so suppose the
55:54
primary does indeed generate the reply
55:58
here back to the client but the client
56:01
the primary crashes just after sending
56:03
the report its reply to the client and
56:05
furthermore and much worse it turns out
56:08
that you know this is just a network it
56:10
doesn't guarantee to deliver packets
56:12
let's suppose this log entry on the
56:16
logging channel got dropped also when
56:18
the when the primary died so now the
56:21
state of play is the client received a
56:23
reply saying 11 but the backup did not
56:28
get the client request so its state is
56:29
still 10 no now the backup takes over
56:34
because it's seized the primary is dead
56:37
and this client or maybe some other
56:39
client sends an increment request a new
56:41
backup and now it's really processing
56:43
these requests and so the new backup
56:45
when it gets the next increment requests
56:47
you know it's now going to change its
56:49
state to 11 and generate a second 11
56:55
response maybe the same client maybe to
56:58
a different client which if the clients
57:00
compare notes or if it's the same client
57:01
it's just obviously cannot have happened
57:04
I didn't so you know because we have to
57:07
support unmodified software that does
57:10
not
57:11
damn that there's any funny business of
57:13
replication going on that means we do
57:15
not have the opportunity to you know you
57:17
can imagine the client could go you know
57:19
we could change the client to realize
57:20
something funny it happened with the
57:22
fault tolerance and do I don't know what
57:24
but we don't have that option here
57:25
because this whole system really only
57:27
makes sense if we're running unmodified
57:29
software so so this was a big this is a
57:33
disaster we can't have let this happen
57:38
does anybody remember from the paper how
57:43
they prevent this from happening the
57:45
output rule yeah so you want to do you
57:48
know yeah so the output rules is the
57:57
their solution to this problem and the
58:03
idea is that the client he's not allowed
58:05
to generate you know and generate any
58:08
output the primary's not allowed to
58:10
generate any output and what we're
58:11
talking about now is this output here
58:13
until the backup acknowledges that it
58:17
has received all log records up to this
58:21
point so the real sequence at the
58:24
primary then let's now undone crash the
58:27
primary go back to them starting at 10
58:32
the real sequence now when the output
58:34
rule is that the input arrives at the
58:40
time the input arrives that's when the
58:42
virtual machine monitor sends a copy of
58:46
the input to the backup so the the sort
58:50
of time at which this log message with
58:53
the input is sent is before strictly
58:55
before the primary generates the output
58:58
sort of obvious then after firing this
59:03
log entry off across a network and now
59:05
it's heading towards the backup but I'd
59:08
have been lost my not the virtual
59:13
machine monitor delivers a request to
59:14
the primary server software it generates
59:16
the output so now the
59:20
replicated you know the primary has
59:22
actually generated change the state 211
59:24
and generated an output packet that says
59:26
eleven but the virtual machine monitor
59:28
says oh wait a minute we're not allowed
59:29
to generate that output until all
59:31
previous log records have been
59:32
acknowledged by the backup so you know
59:34
this is the most recent previous log
59:37
message so this output is held by the
59:39
virtual machine monitor until the this
59:42
log entry containing the input packet
59:44
from the client is delivered to the
59:47
virtual machine monitor and buffered by
59:48
the virtual machine monitor but do not
59:50
necessarily execute it it may be just
59:52
waiting for the backup to get to that
59:54
point in the instruction stream and then
59:56
the virtual machine monitor here will
59:59
send an active packet back saying yes I
60:00
did get that input and when the
60:02
acknowledgment comes back only then will
60:05
the virtual machine monitor here release
60:07
the packet out onto the network and so
60:11
the idea is that if the client could
60:12
have seen the reply then necessarily the
60:16
backup must have seen the request and at
60:18
least buffered it and so we no longer
60:22
get this weird situation in which a
60:25
client can see a reply but then there's
60:27
a failure and a cut over and the replica
60:29
didn't know anything about that reply if
60:33
the you know there's also a situation
60:36
maybe this message was lost and if this
60:40
log entry was lost and then the primary
60:43
crashes well since it hadn't been
60:45
delivered so the backup hadn't sent the
60:47
act that means if the primary crashed
60:49
you know this log entry was brought in
60:52
the primary crashed it must have crashed
60:53
before the virtual machine monitor or at
60:56
least the output packet and prayer for
60:58
this client couldn't have gotten the
61:00
reply and so it's not in a position to
61:03
spot any irregularities they're already
61:09
happy with the output rule
61:27
brennon see I don't know they don't
61:31
paper doesn't mention how the virtual
61:34
machine monitor is implemented I mean
61:35
it's pretty low level stuff because you
61:38
know it's sitting there allocating
61:39
memory and figuring page tables and
61:41
talking to device drivers and
61:43
intercepting instructions and
61:44
understanding what instructions the
61:46
guest was executing so we're talking
61:49
about low-level stuff what language is
61:51
written in you know traditionally C or
61:53
C++ but I don't actually know okay this
61:59
of the primary has to delay at this
62:02
point waiting for the backup to say that
62:07
it's up to date this is a real
62:09
performance thorn in the side of just
62:12
about every replication scheme this sort
62:15
of synchronous wait where the we can't
62:18
let the primary get too far ahead of the
62:19
backup because if the primary failed
62:22
while it was ahead that would be the
62:24
backup lagging lagging behind clients
62:27
right so just about every replication
62:30
system has this problem that at some
62:31
point the primary has to stall waiting
62:34
for the backup and it's a real limit on
62:36
performance even if the machines are
62:38
like side-by-side and adjacent racks
62:40
it's still you know we're talking about
62:41
a half a millisecond or something to
62:44
send messages back and forth with a
62:45
primary stalled and if we wanna like
62:49
withstand earthquakes or citywide power
62:51
failures you know the primary in the
62:53
backup have to be in different cities
62:54
that's probably five milliseconds apart
62:56
every time we produce output if we
62:59
replicate in the two replicas in
63:01
different city every packet that it
63:03
produces this output has to first wait
63:05
the five milliseconds or whatever to
63:08
have the last log entry get to the
63:09
backup and how they need Osment come
63:11
back and then we can release a path
63:12
packet and you know for sort of low
63:15
intensity services that's not a problem
63:18
but if we're building a you know
63:19
database server that we would like to
63:21
you know that if it weren't for this
63:22
could process millions of requests per
63:25
second then
63:25
that's just unbelievably damaging for
63:28
performance and this is a big reason why
63:31
people you know you know if they
63:34
possibly can use a replication scheme
63:38
that's operating at a higher level and
63:39
kind of understands the semantics of
63:41
operations and so it doesn't have to
63:42
stall on every packet you know it could
63:45
stall on every high level operation or
63:47
even notice that well you know read-only
63:49
operations don't have to stall at all
63:51
it's only right so that just all or
63:52
something but you have to there has to
63:54
be an application level replication
63:55
scheme to to realize that you're
64:04
absolutely right so the observation is
64:06
that you don't have to stall the
64:07
execution of the primary you only have
64:08
to hold the output and so maybe that's
64:11
not as bad as it could be but
64:13
nevertheless it means that every you
64:16
know in a service that could otherwise
64:17
have responded in a couple of
64:19
microseconds to the client you know if
64:22
we have to first update the replicas in
64:24
the next city we turn to you know 10
64:27
micro second interaction into it 10
64:29
millisecond interactions possibly if you
64:36
have vast numbers of clients submitting
64:39
concurrent requests then you may may be
64:41
able to maintain high throughput even
64:43
with high latency but you have to be
64:46
lucky to or very clever designer to get
64:49
that
65:01
that's a great idea but if you log in
65:04
the memory of the primary that log will
65:06
disappear when the primary crashes or
65:08
that's usual semantics of a server
65:10
failing is that you lose everything
65:13
inside the box like the contents of
65:16
memory or you know if even if you didn't
65:19
if the failure is that somebody
65:21
unplugged the power cable accidentally
65:23
from the primary even if the primary
65:25
just has battery backed up RAM or I
65:27
don't know what you can't get at it
65:30
all right the backup can't get at it so
65:32
in fact this system does log the output
65:36
and the place it logs it is in the
65:37
memory of the backup and in order to
65:39
reliably log it there you have to
65:42
observe the output rule and wait for the
65:43
acknowledgment so it's entirely correct
65:46
idea just can't use the primary's memory
65:48
for it yes
65:58
say it again that's a clever idea I'd
66:06
and so the question is maybe input
66:08
should go to the primary but output
66:11
should come from the backup
66:12
I completely haven't thought this
66:14
through that might work that
66:17
I don't know that's interesting
66:29
yeah maybe I will
66:42
okay one possibility this does expose
66:48
though is that the situation you know
66:56
maybe the a primary crashes after its
66:58
output is released so the client does
67:00
receive the reply then the primary
67:02
crashes the backups input is still in
67:07
this event buffer in the virtual machine
67:09
monitor of the backup it hasn't been
67:13
delivered to the actual replicated
67:15
service when the backup goes live after
67:18
the crash of the primary the backup
67:22
first has to consume all of the sort of
67:25
log records that are lying around that
67:27
it hasn't consumed X it has to catch up
67:29
to the primary otherwise it won't take
67:30
over with the same state so before the
67:33
backup can go live it actually has to
67:34
consume all these entries the last entry
67:37
is presumably is the request from the
67:41
client so the backup will be live after
67:45
after it after the interrupt that
67:49
delivers the request from the client and
67:51
that means that the backup well you know
67:54
increment its counter to eleven and then
67:56
generate an output packet and since it's
67:58
live at this point it will generate the
68:01
output packet and the client will get to
68:04
eleven replies which is also if it if
68:10
that really happened would be anomalous
68:15
like possibly not something that could
68:18
happen if there was only one server the
68:22
good news is that almost certainly or
68:25
the almost certainly the client is
68:27
talking to this service using TCP and
68:29
that this is the request and the
68:30
response go back and forth on a TCP
68:32
Channel the when the backup takes over
68:35
the backup since the state is identical
68:37
to the primaries it knows all about that
68:39
TCP connection and whether all the
68:40
sequence numbers are and whatnot and
68:43
when it generates this packet it will
68:46
generate it with the same TCP sequence
68:49
number as an original packet and the TCP
68:52
stack on the client will say oh wait a
68:53
minute that's a duplicate packet
68:55
we'll discard the duplicate packet at
68:57
the TCP level and the user level
68:59
software will just never see this
69:00
duplicate and so this system really you
69:04
know you can view this as a kind of
69:09
accidental or clever trick but the fact
69:11
is for any replication system where
69:14
cutover can happen which is to say
69:16
pretty much any replication system it's
69:20
essentially impossible to design them in
69:22
a way that they are guaranteed not to
69:24
generate duplicate output basically you
69:29
know you well you can err on either side
69:31
I'm not even either not generate the
69:33
output at all which would be bad which
69:36
would be terrible or you can generate
69:37
the output twice on a cutover that's
69:41
basically no way to generate it
69:42
guaranteed generated only once everybody
69:44
errors on the side of possibly
69:46
generating duplicate output and that
69:49
means that at some level you know the
69:51
client side of all replication schemes
69:53
need some sort of duplicate detection
69:55
scheme here we get to use TCP s that we
69:57
didn't have TCP that would have to be
69:59
something else maybe application level
70:01
sequence numbers or I don't know what
70:03
and you'll see all of this and actually
70:06
you'll see versions of essentially
70:09
everything I've talked about like the
70:10
output rule for example in labs 2 & 3
70:14
you'll design your own replicated state
70:17
machine yes
70:45
yes to the first part so the scenario is
70:48
the primary sends the reply and then
70:51
either the primary send the clothes
70:53
packet or the client closes the connect
70:55
the TCP connection after it receives the
70:57
primary's reply so now there's like no
70:58
connection on the client side but there
71:00
is a connection on the backup side and
71:02
so now the backup so the backup consumes
71:06
the very last log entry that as the
71:07
input is now live so we're not
71:10
responsible for replicating anything at
71:12
this point right because the backup now
71:14
live there's no other replica as the
71:16
primary died so there's no like if if we
71:20
don't if the backup fails to execute in
71:23
lockstep with the primary that's fine
71:24
actually because the primary is is dead
71:26
and we do not want to execute in
71:28
lockstep with it okay so the primer is
71:30
now not it's live it generates an output
71:33
on this TCP connection that isn't closed
71:37
yet from the backup point of view this
71:39
packet arrives with the client on a CCP
71:41
connection that doesn't exist anymore
71:43
from the clients point of view like no
71:45
big whoopee on the client right he's
71:46
just going to throw away the packet as
71:48
if nothing happened the application
71:50
won't no the client may send a reset
71:52
something like a TCP error or whatever
71:54
packet back to the backup and the backup
71:57
does something or other with it but it
71:58
doesn't matter because we're not
72:00
diverging from anything because there's
72:02
no primary to diverge from you can just
72:04
handle a stray we said however it likes
72:08
and what it'll in fact do is basically
72:10
ignore but there's no now the backup has
72:14
gone live there's just no we don't owe
72:17
anybody anything as far as replication
72:19
yeah
72:36
well you can bet since the backups
72:39
memory image is identical to the
72:40
primaries image that they're sending
72:42
packets with the very same source TCP
72:45
number and they're very same everything
72:48
they're sending bit for bit identical
72:51
packets you know at this level the
73:00
server's don't have IP addresses or for
73:03
our purposes the virtual machines you
73:06
know the primary in the back up virtual
73:08
machines have IP addresses but the the
73:12
physical computer and the vmm are
73:15
transparent to the network it's not
73:17
entirely true but it's basically the
73:19
case that the virtual machine monitor in
73:21
the physical machine don't really have
73:23
identity of their own on the network
73:26
because you can configure that then that
73:29
way instead these they're not you know
73:31
the virtual machine with a sewing
73:33
operating system in its own TCP stack it
73:35
doesn't IP address underneath there an
73:36
address and all this other stuff which
73:37
is identical between the primary in the
73:39
backup and when it sends a packet it
73:41
sends it with the virtual machines IP
73:42
address and Ethernet address and those
73:44
bits least in my mental model are just
73:49
simply passed through on to the local
73:51
area network it's exactly what we want
73:54
and so I think he doesn't generate
73:55
exactly the same packets that the
73:57
primary would have generated there's
73:59
maybe a little bit of trickery
74:00
you know what the we if this is these
74:03
are actually plugged into an Ethernet
74:04
switch into the physical machines maybe
74:06
it wasn't in two different ports of an
74:08
Ethernet switch and we'd like the
74:09
Ethernet switch to change its mind about
74:12
which of these two machines that
74:14
delivers packets with replicated
74:18
services Ethernet address and so there's
74:20
a little bit of funny business there for
74:23
the most part they're just generating
74:24
identical packets so let me just send
74:26
them out
74:29
okay so another little detail I've been
74:33
glossing over is that I've been assuming
74:36
that the primary just fails or the
74:38
backup just fails that is fail stop
74:41
right but that's not the only option
74:43
another very common situation that has
74:46
to be dealt with is if the two machines
74:49
are still up and running and executing
74:51
but there's something funny happen on
74:53
the network that causes them not to be
74:56
able to talk to each other but to still
74:58
be able to talk to some clients so if
75:01
that happened if the primary backup
75:03
couldn't talk to each other but they
75:05
could still talk to the clients they
75:06
would both think oh the other replicas
75:07
dead I better take over and go live and
75:10
so now we have two machines going live
75:12
with this service and now you know
75:14
they're no longer sending each other log
75:16
events or anything they're just
75:17
diverging maybe they're accepting
75:19
different client inputs and changes are
75:21
stayed in different ways so now we have
75:22
a split brain disaster if we let the
75:24
primary in the backup go live because it
75:28
was a network that has some kind of
75:30
failure instead of these machines and
75:34
the way that this paper solves it I mean
75:36
is by appealing to an outside authority
75:41
to make the decision about which of the
75:44
primary of the backup is allowed to be
75:46
live and so it they're you know it turns
75:53
out that their storage is actually not
75:54
on local disk this almost doesn't matter
75:56
but their storage is on some external
75:58
disk server and as well as being in this
76:01
server as a like totally separate
76:03
service there's nothing to do with disks
76:05
there this server happens to abort this
76:07
test and set test and set service over
76:15
the network where you you can send a
76:17
test and set request to it and there's
76:19
some flag it's keeping in memory and
76:21
it'll set the flag and return what the
76:23
old value was so both primary and backup
76:25
have to sort of acquire this test and
76:28
set flag it's a little bit like a lock
76:30
in order to go live they both may be
76:32
send test and set requests at the same
76:34
time to this test and set server the
76:37
first one gets back a reply that says oh
76:39
the flag used to be zero now it's one
76:41
this
76:42
second request to arrive the response
76:44
from the testing set server is Oh
76:46
actually the flag was already one when
76:47
your request arrived so so basically
76:50
you're not allowed to be primary and so
76:52
this this test and set server and we can
76:55
think of it as a single machine is the
76:58
arbitrator that decides which of the two
77:00
should go live if they both think the
77:02
other ones dead due to a network
77:04
partition any questions about this
77:08
mechanism you're busted yeah if the test
77:17
and set server should be dead at the
77:19
critical moment when and so actually
77:22
even if there's not a network partition
77:24
under all circumstances in which one or
77:27
the other of these wants to go live
77:28
because it thinks the others dead even
77:30
when the other one really is dead the
77:32
one that wants to collide still has to
77:33
acquire the test and set lock because
77:35
one of like the deep rules of 6:18 for
77:39
game is that you cannot tell whether or
77:43
another computer is dead or not all you
77:45
know is that you stopped receiving
77:47
packets from it and you don't know
77:49
whether it's because the other computer
77:50
is dead or because something has gone
77:53
wrong with the network between you and
77:55
the other computer so all the backup
77:57
ceases well I've stuck in packets maybe
77:59
the primary is dead maybe it's live
78:00
primary probably sees the same thing so
78:03
if there's a network partition they
78:04
certainly have to ask the TAT since that
78:05
server but since they don't know if it's
78:07
a network partition they have to ask the
78:08
testing set server regardless of whether
78:11
it's a partition or not so anytime
78:13
either wants to collide the test and set
78:15
server also has to be alive because they
78:17
always have to acquire this testing set
78:19
lock so the test and set server
78:22
sounds like a single point of failure
78:24
they were trying to build a replicated
78:26
fault tolerant whatever thing but in the
78:29
end you know we can't failover unless
78:30
unless this is alive so that's a bit of
78:35
a bummer
78:36
I'm guessing though I'm making a strong
78:39
guess that the test and set server is
78:41
actually itself a replicated service and
78:44
is fault tolerant right it's almost
78:46
certainly I mean these people are being
78:49
where they're like happy to sell you a
78:50
million dollar highly available storage
78:53
system that
78:54
uses enormous amounts of replication
78:56
internally um since the testing set
78:59
thing is on there dis server I'm I'm
79:00
guessing it's replicated too and the
79:03
stuff you'll be doing in lab 2 in lab 3
79:05
is more than powerful enough for you to
79:07
build your own fault-tolerant test and
79:11
set server so this problem can easily be
79:13
eliminated