字幕記錄


00:00
today I want to do two things I want to
00:03
finish the discussion of zookeeper and
00:06
then talk about crack the particular
00:10
things that I'm most interested in
00:12
talking about a bad zookeeper are the
00:15
design of its API that allows the
00:17
zookeeper to be a general-purpose
00:19
service that really bites off
00:21
significant tasks that distributed
00:24
systems need so why is this you know why
00:27
is that a good API design and then the
00:29
really more specific topic of mini
00:33
transactions turns out this is a
00:35
worthwhile idea to know so they got API
00:38
and I'm just just a recall zookeepers
00:50
based on raft and so we can think of it
00:52
as being and indeed it is fault tolerant
00:54
and does the right thing with respect to
00:56
partitions it has this sort of
01:00
performance enhancement by which reads
01:04
can be processed at any replica and
01:06
therefore the reads can be stale so we
01:08
just have to keep this in mind as we're
01:09
analyzing various uses of the zookeeper
01:12
interface on the other hand zookeeper
01:15
does guarantee that every replicas
01:18
process the stream of rights in order
01:20
one at a time with all replicas
01:22
executing the rights in the same order
01:24
so that the replicas advance sort of in
01:27
their states of all than exactly the
01:30
same way and that all of the operation
01:33
reads and writes produced by a generated
01:36
by a single client or processed by the
01:38
system also in order both in the order
01:41
that the client issued them in and
01:43
successive operations from a given
01:45
client always see the same state or
01:48
later in the right stream as the
01:51
previous read operation right any
01:53
operation from that client okay so
01:59
before I dive into what the API looks
02:02
like and why it's useful it's worth just
02:06
thinking about what kinds of problems
02:08
zookeeper is aiming to solve or could be
02:10
expected to solve so
02:12
for me
02:13
a totally central example of motivation
02:23
of why you would want to use ooh keeper
02:25
this is it as an implementation of the
02:28
test and set service that vmware ft
02:31
required in order for either server to
02:34
take over when the other one failed it
02:38
was a bit of a mystery in the vmware
02:39
paper
02:40
what is this test instant service how is
02:43
it may you know is it fault tolerant
02:45
does it itself tolerate partitions but
02:49
zookeeper actually gives us the tools to
02:51
write a fault tolerant test and set
02:55
service of exactly the kind that vmware
03:00
ft needed that is fault tolerant and
03:03
does do the right thing under partitions
03:05
that's sort of a central kind of thing
03:06
that zookeepers doing there's also a
03:09
bunch of other ways that turns out
03:10
people use it suki was very successful
03:12
people use it for a lot of stuff one
03:15
kind of thing people use is just to
03:17
publish just configuration information
03:20
for other servers to use like for
03:21
example the IP address of the current
03:24
master for some set of workers this is
03:29
just config configuration information
03:34
another classic use of zookeepers to
03:37
elect a master you know if we want to
03:39
have a when the old master fails we need
03:41
to have everyone agree on who the new
03:43
master is and only elect one master even
03:45
if there's partitions you can elect a
03:49
master using zookeeper primitives if the
03:58
master for small amounts of stated
04:00
anyway if whatever master you elect
04:02
needs to keep some state it needs to
04:03
keep it up-to-date like maybe you know
04:06
informations such as who the primary is
04:09
for a given chunk of data like you'd
04:11
want in GFS the master can store its
04:13
state in zookeeper it knows new keepers
04:16
not going to lose it if the master
04:17
crashes and we elect a new master to
04:19
replace it that new master can just read
04:21
the old master state right out of
04:22
zookeeper and rely on it actually being
04:24
there
04:27
other things you might imagine maybe you
04:30
know MapReduce like systems workers
04:32
could register themselves by creating
04:34
little files and zookeeper and again
04:38
with systems like MapReduce you can
04:40
imagine the master telling the workers
04:43
what to do by writing things in
04:45
zookeeper like writing lists of work in
04:47
zookeeper and then worker sort of take
04:49
those work items one by one out of
04:52
zookeeper and delete them as they
04:54
complete them but people use zookeeper
04:56
for all these things question
05:03
yeah
05:11
exactly yeah so the question is oh how
05:16
people use zookeeper and in generally
05:18
yeah you you would if you're running
05:19
some big data center and you run all
05:21
kinds of stuff in your data center you
05:23
know web servers storage systems
05:26
MapReduce who knows what you might fire
05:28
up a zookeeper one zookeeper cluster
05:30
because this general purpose can be used
05:31
for lots of things so you know five or
05:34
seven zookeeper replicas and then as you
05:37
deploy various services you would design
05:39
the services to store some of the
05:41
critical state in your one zookeeper
05:43
cluster alright the API zookeeper looks
05:50
like a filesystem some levels it's got a
05:53
directory hierarchy you know there's a
05:56
root directory and then maybe you could
05:58
maybe each application has its own sub
06:01
directory so maybe the application one
06:03
keeps its files here in this directory
06:05
app two keeps its files in this
06:08
directory and you know these directories
06:11
have files and directories underneath
06:12
them
06:13
one reason for this is just because you
06:16
keeper is like just mentioned is a
06:18
design to be shared between many
06:20
possibly unrelated activities we just
06:23
need a naming system to be able to keep
06:25
the information from these activities
06:27
distinct so they don't get confused and
06:30
read each other's data by mistake
06:32
within each application it turns out
06:34
that a lot of convenient ways of using
06:36
zookeeper will involve creating multiple
06:39
files let's see a couple examples like
06:42
this in a few minutes okay so it looks
06:47
like a filesystem this is you know not
06:49
very deep it doesn't it's not actually
06:51
you know you can't really use it like a
06:52
file system in the sense of mounting it
06:54
and running LS and cat and all those
06:56
things it's just that internally it
06:58
names objects with these path names so
07:00
you know one this x y&z here few
07:04
different files you know when you talk
07:06
to me you send an RPC - zookeeper saying
07:08
you know please read this data you would
07:11
name the data you want maybe add up to
07:13
slash X there's just a sort of
07:16
hierarchical naming scheme these these
07:21
files and directories are called Z nodes
07:23
and it turns out it's there's three
07:27
types you have to know about that helps
07:30
you keep or solve various problems for
07:31
us there's just regular Z nodes where if
07:33
you create one it's permanent until you
07:36
delete it there's a femoral Z nodes
07:40
where if a client creates an ephemeral Z
07:42
node zookeeper will delete that
07:45
ephemeral Z node if it believes that the
07:48
client has died it's actually tied to
07:50
client sessions so clients have to sort
07:52
of send a little heartbeat in every once
07:54
a while into the zookeeper into
07:56
zookeeper say oh I'm still alive I'm
07:57
still alive so zookeeper won't delete
07:59
their ephemeral files and the last
08:04
characteristic files may have is
08:07
sequential and that means when you ask
08:10
to create a file with a given name what
08:12
you actually end up creating is a file
08:14
with that name but with a number
08:16
appended to the main and zookeeper
08:18
guarantees never to repeat a number if
08:21
multiple clients try to create
08:23
sequential files at the same time and
08:25
also to always use montt increasing
08:29
numbers for the for the sequence numbers
08:32
that are pens to filenames and we'll see
08:34
all of these things come up in examples
08:37
at one level the operations the RPC
08:40
interface that zookeeper exposes is sort
08:44
of what you might expect for your files
08:47
was to create RPC where you give it a
08:51
name really a full path name some
08:57
initial data and some combination of
09:02
these flags and interesting semantics of
09:09
create is that it's exclusive that is
09:11
when I send a create into zookeeper ask
09:13
it to create a file so you keep your
09:15
responds with a yes or no if that file
09:18
didn't exist and I'm the first client
09:20
who wants to create it zookeeper says
09:21
yes and creates the file the file
09:23
already exists zookeeper says no or
09:26
returns an error so clients know it's
09:28
exclusive create and clients know
09:30
whether they were the one client if
09:32
multiple clients are trying to create
09:33
the same file which we'll see in locking
09:35
samples
09:36
the clients will know whether they were
09:38
the one who actually managed to create
09:40
the file
09:46
there's also delete and one thing I
09:54
didn't mention is ever easy note has a
09:56
version as a current version number that
09:57
advances as its modified and delete
10:01
along with some other update operations
10:06
you can send an a version number saying
10:07
only do this operation if the files
10:10
current version number is the version
10:12
that was specified and that'll turn out
10:15
to be helpful if you're worried about in
10:17
situations where multiple clients might
10:18
be trying to do the same operation at
10:20
the same time so you can pass a version
10:23
saying only delete there's an exists
10:27
call oh does this path named Xenu exist
10:33
an interesting extra argument is that
10:36
you can ask to watch for changes to
10:39
whatever path name you specified you can
10:42
say does this path name exist and
10:43
whether or not exists it exists now if
10:46
you set this watch if you pass in true
10:48
for this watch flag zookeeper guarantees
10:50
to notify the client if anything changes
10:53
about that path name like it's created
10:55
or deleted or modified and furthermore
11:00
the the check for whether the file
11:03
exists and the setting of the watch
11:06
point of the watching information in the
11:08
inside zookeeper or atomic so nothing
11:11
can happen between the point at which
11:13
the point in the write stream which
11:16
zookeeper looks to see whether the path
11:18
exists and the point in the write stream
11:20
at which zookeeper inserts the watch
11:23
into its table and then that's like very
11:25
important for for correctness we also
11:31
get D then you get a path and again the
11:37
watch flag and now the watch just
11:40
applies to the contents of that file
11:43
there's set data
11:50
again path the new data and this
11:55
conditional version that if you pass an
11:58
inversion then zookeeper only actually
12:00
does the right if the current version
12:02
number of the file is equal to the
12:03
number you passed in okay so let's see
12:10
how we use this the first maybe almost
12:13
this first very simple example is
12:14
supposing we have a file in zookeeper
12:17
and we want to store a number in that
12:20
file and we want to be able to increment
12:21
that number so we're keeping maybe a
12:23
statistics count and whenever a client
12:24
you know I know gets a request from a
12:27
web user or something it's going to
12:29
increment that count in zookeeper and
12:34
more than one client may want to
12:36
increment the count that's the critical
12:39
thing so an example so one thing to sort
12:47
of get out of the way is whether we
12:49
actually need some specialized interface
12:52
in order to support client coordination
12:57
as opposed to just data this looks like
12:59
a file system could we just provide the
13:01
ordinary readwrite kind of file system
13:03
stuff that dated that typical storage
13:06
systems provide so for example some of
13:09
you have started and you'll all start
13:11
soon Ladd 3 in which you build a key
13:13
value store where the two operations are
13:15
the only operations are put key value
13:20
and so one question is can we do you
13:27
know all these things that we might want
13:28
to do with zookeeper can we just do them
13:30
with lab 3 with a key with a key value
13:32
put get interface so supposing for my I
13:35
want to implement this count thing maybe
13:38
I could implement the count with just
13:40
lab threes key value interface so you
13:43
might increment the count by saying x
13:45
equals get you know whatever key were
13:49
using and then put that key an X plus 1
13:59
why why is this a bad answer yes yes oh
14:08
it's not atomic that is absolutely the
14:11
root of the problem here and you know
14:15
the abstract way of putting it but one
14:19
way of looking at it is that of two
14:20
clients both want to increment the
14:22
counter at the same time they're both
14:24
gonna read they're both gonna use get to
14:26
read the old value and get you know ten
14:28
those gonna add one to ten and get 11
14:30
and I was gonna call put with 11 so now
14:33
we've increased the counter by one but
14:36
two clients were doing it so surely we
14:37
should have ended up increasing it by
14:39
two so that's why the lab three cannot
14:44
be used for even this simple example
14:47
furthermore in the sort of zookeeper
14:50
world where guests can return stale data
14:52
is not lab 3 or gets are not allowed to
14:55
return stale data but in zookeeper reads
14:57
can be stale and so if you read a stale
15:00
version of the current counter and add
15:02
one to it
15:03
you're now writing the wrong value you
15:05
know if 30 values 11 but you're get
15:09
returns a stale value of 10 you add 1 to
15:12
that and put 11 that's a mistake because
15:14
we really should have been putting 12 so
15:15
zookeeper has this additional problem
15:17
that we have to worry about that
15:19
that gets don't return the latest data
15:25
ok so how would you do this in zookeeper
15:32
here's how I would do this in zookeeper
15:40
it turns out you need to do you need to
15:42
wrap this code Siemens in a loop because
15:46
it's not guaranteed to succeed the first
15:48
time so we're just gonna say while true
15:55
we're gonna call get data to get the
15:57
current value of the counter and the
15:59
current version so we're gonna say X V
16:01
equals I'm get data and we need to say
16:09
final name I don't care what the file
16:11
name is we just say that nice now we get
16:13
the well we get a value and a version
16:16
number possibly not fresh possibly stale
16:20
but maybe fresh and then we're gonna use
16:26
a conditional put a conditional setting
16:45
and if set data is a set data operation
16:48
return true meaning it actually did set
16:50
the value we're gonna break otherwise
16:52
just go back to the top of the loop
16:55
otherwise so what's going on here is
17:00
that we read some value and some version
17:03
number maybe still maybe fresh out of
17:04
the replicas the set data we send
17:07
actually did the zookeeper leader
17:08
because all rights go to the leader and
17:10
what this means is only set the value to
17:12
X plus one if the version with the real
17:15
version the latest version is still is V
17:19
so if we read fresh data and nothing
17:23
else is going on in the system like no
17:24
other clients are trying to increment
17:26
this then we'll read the latest version
17:29
latest value we'll add one to the latest
17:31
value specify the latest version and our
17:34
set data will be accepted by the leader
17:35
and we'll get back a positive reply to
17:39
our request after it's committed and
17:42
we'll break because we're done if we got
17:45
stale data here or this was fresh data
17:47
but by the time
17:50
our set data got to the leader some
17:52
other clients set data and some other
17:55
client is trying to increment their set
17:56
data got there before us our version
17:58
number will no longer be fresh in either
18:00
those cases this set data will fail and
18:03
we'll get an error response back it
18:05
won't break out of the loop and we'll go
18:08
back and try again and hopefully we'll
18:11
succeed this time yes yes so the
18:25
question is could this it's a while loop
18:27
or we guaranteed is ever going to finish
18:29
and no no we're not really guaranteed
18:32
that we're gonna finish in practice you
18:36
know so for example if our replicas were
18:39
reading from is cut off from the leader
18:42
and permanently gives us stale data then
18:45
you know maybe this is not gonna work
18:47
out but you know but in real life well
18:51
in real life the you know leaders
18:53
pushing all the replicas towards having
18:56
identical data to the leader so you know
18:58
if we just got stale data here probably
19:00
when we go back you know maybe we should
19:02
sleep for 10 milliseconds or something
19:04
at this point but when we go back here
19:07
eventually we're gonna see the latest
19:08
data the situation under which this
19:10
might genuinely be pretty bad news is if
19:13
there's a very high continuous load of
19:17
increments from clients you know if we
19:18
have a thousand clients all trying to do
19:21
increments the risk is that maybe none
19:25
of them will succeed or something I
19:30
think one of them will succeed because I
19:31
think one of the most succeed because
19:35
you know the the first one that gets its
19:37
set data into the leader will succeed
19:40
and the rest will all fail because their
19:41
version numbers are all too low and then
19:43
the next 999 will put and get data's in
19:46
and one of them will succeed so it all
19:48
have a sort of N squared complexity to
19:50
get through all of the all other clients
19:54
which is very damaging but it will
19:55
finish eventually and so if you thought
19:57
you were gonna have a lot of clients you
19:59
would use a different strategy here this
20:01
is good
20:02
or load situations yes if they fit in
20:17
memory it's no problem if they don't fit
20:18
memory it's a disaster so yeah when
20:21
you're using zookeeper you have to keep
20:23
in mind that it's yeah it's great for
20:26
100 megabytes of stuff and probably
20:29
terrible for 100 gigabytes of stuff so
20:31
that's why people think of it as storing
20:32
configuration information rather than
20:35
their we old data of your big website
20:38
yes I mean it's sort of watch into this
20:53
sequence
20:58
yet that could be so if we want if we
21:04
wanted to fix this to work under high
21:06
load then you would certainly want to
21:13
sleep at this point where I'm not well
21:17
the way I would fix this my instinct I'm
21:19
fixing this would be to insert asleep
21:21
here and furthermore double the amount
21:25
of it sort of randomized sleep whose
21:30
span of randomness doubles each time we
21:33
fail and that's a sort of tried and true
21:37
strategies exponential back-off is a
21:39
it's actually similar to raft leader
21:42
election it's a reasonable strategy for
21:44
adapting to an unknown number of
21:47
concurrent clients so okay tell me
21:54
what's right okay so we're getting data
22:03
and then watching its true
22:17
so yes so if somebody else modifies the
22:22
data before you call set data maybe
22:25
you'll get a watch notification um the
22:28
problem is the timing is not working in
22:30
your favor like the amount of time
22:32
between when I received the data here
22:34
and when I send off the message to the
22:37
leader with this new set data is zero
22:39
that's how much time will pass here
22:42
roughly and if some other client is sent
22:47
in increment at about this time it's
22:51
actually quite a long time between when
22:53
that client sends in the increment and
22:54
when it works its way through the leader
22:56
and is sent out to the followers and
22:58
actually executed the followers and the
22:59
followers look it up in their watch
23:01
table and send me a notification so I
23:03
think
23:17
it won't give you any read result or if
23:25
you read at a point if you're gonna read
23:28
at a point that's after where the
23:30
modification occurred that should raise
23:32
the watch you'll get the notification of
23:34
the watch before you get the read
23:36
response but in any case I think nothing
23:40
like this could save us because what's
23:42
gonna happen is all thousand clients are
23:45
gonna do the same thing whatever it is
23:47
right they're all gonna do again and set
23:50
a watch and whatever they're all gonna
23:52
get the notification at the same time
23:53
they're all gonna make the same decision
23:54
about well they're all not gonna get to
23:57
watch because none of them has done the
23:59
put data yet right
24:01
so the worst case is all the clients are
24:03
starting at the same point they all do a
24:05
get they all get version one they all
24:08
set a watch point they don't get a
24:09
notification because no change has
24:10
occurred they all send a set data RPC to
24:14
the leader all thousand of them the
24:17
first one changes the data and now the
24:21
other 999 and get a notification when
24:23
it's too late because they've already
24:24
sent the set data so it's possible that
24:29
watch could help us here but sort of
24:33
straightforward version of watch I have
24:38
a feeling if you wanted the the mail
24:42
we'll talk about this in a few minutes
24:43
but the anon heard the second locking
24:48
example absolutely solves this kind of
24:50
problem so we could adapt to the second
24:53
locking example from the paper to try to
24:54
cause the increments to happen one at a
24:57
time if there's a huge number of clients
25:00
who want to do it other questions about
25:03
this example okay this is an example of
25:08
a what many people call a mini
25:11
transaction all right it's transactional
25:13
in a sense that wow there's you know a
25:15
lot of funny stuff happening here the
25:17
effect is that once it all succeeds we
25:22
have achieved an atomic
25:23
read-modify-write of the counter right
25:26
the difficulty here
25:28
is that it's not atomic the reading the
25:35
right the read the modifying the right
25:37
are not atomic the thing that we have
25:40
pulled off here is that this sequence
25:42
once it finishes is atomic right we
25:47
actually man and once we have to be on
25:49
the pass through this that we succeeded
25:50
we managed to read increment and write
25:53
without anything else intervening we
25:56
managed to do these two steps atomically
25:59
and you know this is not because this
26:05
isn't a full database transaction like
26:06
real databases allow fully general
26:09
transactions where you can say start
26:10
transaction and then read or write
26:12
anything you like maybe thousands of
26:14
different data items whatever who knows
26:15
what and then say end transaction and
26:17
the database will cleverly commit the
26:19
whole thing as an atomic transaction so
26:21
real transactions can be very
26:22
complicated zookeeper supports this
26:25
extremely simplified version of you know
26:27
when you're sort of one we can do it
26:30
atomic sort of operations on one piece
26:34
of data but it's enough to get increment
26:37
and some other things so these are for
26:39
that reason since they're not general
26:40
but they do provide atomicity these are
26:42
often called mini transactions and it
26:50
turns out this pattern can be made to
26:52
work with various other things too like
26:54
if we wanted to do the test and set that
26:57
vmware ft requires it can be implemented
27:00
with very much this setup you know maybe
27:02
the old value if it's zero then we try
27:06
to set it to one but give this version
27:08
number you know nobody else intervened
27:10
and we were the one who actually managed
27:11
to set it to one because the version
27:13
number hadn't changed but i'm leader got
27:14
our request and we win somebody else
27:17
changes to one after we read it then the
27:21
leader will tell us that we lost so you
27:23
can do test and set with this pattern
27:24
also and you should remember this is the
27:29
strategy
27:33
okay alright next example I want to talk
27:38
about is these locks and I'm talking
27:42
about this because it's in the paper not
27:43
because I strongly believe that this
27:46
kind of lock is useful but they have
27:52
they have an example in which a choir
27:57
has a couple steps one we try to create
28:01
we have a lock file and we try to create
28:05
the lock file now again some file with a
28:11
femoral set to true and so if that
28:17
succeeds then or not we've acquired the
28:22
lock the second step that doesn't
28:27
succeed then we want to wait for whoever
28:32
did acquire the lock what if this isn't
28:34
true that means the lock file already
28:36
exists I mean somebody else has acquired
28:37
the lock and so we want to wait for them
28:39
to release the lock and they're gonna
28:40
release the lock by deleting this file
28:42
so we're gonna watch yes
28:56
alright so we're gonna watch we're gonna
28:59
gonna call exists and watching is true
29:11
now it turns out that um okay and and
29:15
and if the file still exists right which
29:17
we expect it to because after all they
29:18
didn't exist presumably would have
29:20
returned here so if it exists we want to
29:21
wait for the notification we're waiting
29:25
for this watch notification call this
29:29
three and a step for go to what so the
29:39
usual deal is you know we call create
29:41
you know maybe we win if it fails we
29:45
wait for whoever owns a lock to release
29:47
it we get the watch notification when
29:49
the file is deleted at that point this
29:51
wait finishes and we go back to Mon and
29:53
try to recreate the file hopefully we
29:54
will get the file this time okay so we
29:59
should ask ourselves questions about
30:01
possible interleavings of other clients
30:04
activities with our four steps so one we
30:07
know for sure we know of already if
30:09
another client calls create at the same
30:11
time then the zookeeper leader is going
30:16
to process those two to create rpcs one
30:19
at a time in some order
30:20
so either mike reid will be executed
30:23
first
30:23
or the other clients create will be
30:24
executed first minds executed first i'm
30:28
going to get a true back in return and
30:29
acquire the lock and the other client is
30:31
guaranteed to get a false return and if
30:33
there are pcs processed first they'll
30:35
get the true return and i'm guaranteed
30:36
to get the false return and in either
30:38
case the file will be created
30:40
so we're okay if we have simultaneous
30:45
executions of one another question is
30:51
well you know if I if create doesn't
30:54
succeed for me and I'm gonna call exists
30:57
what happens if the lock is released
31:01
actually between the create and the
31:03
exists
31:09
so this is the reason why I rap I have a
31:12
knife around me around the exists is
31:14
because it actually might be released
31:15
before I call exists because it could
31:19
have been acquired quite a long time ago
31:21
by some other client and then if the
31:22
file doesn't exist at this point then
31:25
this will fail and I'll just go directly
31:26
back to this go to one and try again
31:32
similarly and actually more interesting
31:35
is what happens if the whoever holds it
31:39
now releases it just as I call exist or
31:43
as the replica I'm talking to is in the
31:45
middle of processing my exists requests
31:49
and the answer to that is that the
31:54
whatever replica I'm looking at you know
31:57
it's log or guaranteed that rights occur
32:02
in some order right
32:04
so the repla I'm talking to it's it's
32:06
log its proceeding in some way and my
32:10
exists call is guaranteed to be executed
32:15
between two log entries in the right
32:18
stream right this is a this is a
32:21
read-only request and you know the
32:24
problem is that somebody's delete
32:26
request is being processed at about this
32:27
time so somewhere in the log is going
32:32
either is going to be the delete request
32:35
from the other client and the rep and
32:37
you know this is my mind the replica
32:40
that I'm talking to zookeeper replicas
32:42
I'm talking to his log my watch my
32:45
exists RPC is either processed
32:47
completely processed here in which case
32:50
the replica sees oh the file still
32:53
exists and the replica inserts the watch
32:56
information into its watch table at this
32:59
point and only then executes the delete
33:02
so when the delete comes in were
33:03
guaranteed that my watch request is in
33:05
the replicas watch table and it will
33:07
send me a notification right or my exist
33:11
requests is executed here at a point
33:15
after the delete happen the file doesn't
33:17
exist and so now the call returns true
33:20
and
33:20
no well actually a watch table entry is
33:23
entered but we don't care right so it's
33:27
quite important that the rights are
33:28
sequenced and that reads happen at
33:32
definite points between rights yes well
33:54
okay so yes so this is where the exists
33:57
is executed the file doesn't exist at
33:58
this point exists returns false we don't
34:01
wait we go to one we create the file and
34:04
return we did install a watch here that
34:08
watch will be triggered it doesn't
34:10
really matter because we're not really
34:11
waiting for it but the watch will be
34:13
triggered by this created
34:23
we're not waiting for it but yeah okay
34:26
so the file doesn't exist we go to one
34:28
somebody else has created the file we
34:31
try to create the file that fails we
34:33
install another watch and it's a dis
34:35
watch that we're not waiting for so this
34:38
way does not a wait for anything to
34:40
happen although it doesn't really matter
34:42
in the moment it's not harmful to to to
34:47
break out of this loop early it's just
34:49
wasteful anyway we've all the history
34:53
this code leaves watches sort of in the
34:57
system and I don't really know what does
34:58
my new watch on the same file override
35:00
my old watch I'm not actually sure
35:08
okay I'm finally this example and the
35:12
previous example suffle suffer from the
35:14
herd effect we also heard effect we
35:16
talked about I mean what we were talking
35:18
about when we were worrying about oh but
35:20
if clients I'll try to increment this at
35:22
the same time
35:23
gosh that's going to have N squared
35:25
complexity as far as how long it takes
35:28
to get to all thousand clients this lock
35:30
scheme also suffers from the herd effect
35:32
in that if there are a thousand clients
35:35
trying to get the lock then the amount
35:37
of time that's required to sort of grant
35:40
the lock to each one of the thousand
35:43
clients is proportional to a thousand
35:44
squared because after every release all
35:47
of the remaining clients get triggered
35:50
by this watch all of the remaining
35:52
clients go back up here and send in a
35:53
create and so the total number create
35:55
our pcs generated is basically a
35:59
thousand squared so this suffers from
36:02
this herd the whole herd of waiting
36:06
clients is beating on zookeeper another
36:15
name for this is that it's a non
36:17
scalable lock or yeah okay and so the
36:23
paper is a real deal and we'll see it
36:26
more and in other systems and soon
36:31
enough serious end of problems the paper
36:32
actually talks about how to solve it
36:34
using zookeeper and the interesting
36:36
thing is that Zook
36:37
it's actually expressive enough to be
36:40
able to build a more complex lock scheme
36:46
that doesn't suffer from this hurt
36:48
effect that even of a thousand clients
36:49
are waiting the cost of one client
36:53
giving up a lock and another acquiring
36:55
it is order 1 instead of order n and
36:59
this is the because it's a little bit
37:02
complex this is the pseudocode in the
37:05
paper in section 2.4 it's on page 6 if
37:08
you want to follow along so this is and
37:23
so this time there is not a single lock
37:25
file
37:27
there's no yes it is just a name that
37:38
allows us to all talk about the same
37:40
lock so it's just a name know now I've
37:50
acquired the lock and I can do I can
37:53
whatever the lock was protecting you
37:55
know maybe only one of us at a time
37:57
should be allowed to give a lecture in
37:59
this lecture hall if you want to give a
38:00
lecture in this lecture hall you first
38:02
have to acquire the lock called 34 100
38:07
the that turns out it's yes it's a Z
38:10
node and zookeeper but it like nobody
38:12
cares about its contents we just need it
38:14
to be able to agree on a name for the
38:16
lock that's the sense in which that's
38:21
piyah this it looks like a file system
38:23
but it's really a naming system alright
38:28
so step one is we create a sequential
38:31
file
38:37
and so yeah we give it a prefix name but
38:39
what it actually creates is you know if
38:42
this is the 27th file sequential file
38:45
created with with prefix F you know
38:48
maybe we get F 27 or something and and
38:53
in the sequenced in the sequence of
38:56
writes that zookeeper is it's working
38:58
through successive creates get ascending
39:03
guaranteed ascending never descending
39:05
always ascending sequence numbers when
39:08
you create a sequential file there was
39:15
an operation I left off from the list it
39:16
turns out you can get a list of files
39:18
you can get a list of files underneath
39:25
you give the name of Zeno that's
39:29
actually a directory with files in it
39:30
you can get a list of all the files that
39:31
are currently in that directory so we're
39:33
gonna list the files let's start with
39:35
that you know maybe list f star we get
39:41
some list back we create a file with the
39:47
system allocated us a number here we can
39:48
look at that number if there's no lower
39:51
numbered file in this list then we win
39:54
and we get the lock
39:55
so if our sequential file is the lowest
39:57
number file with that name prefix we win
40:00
so no lower number we've quired the lock
40:10
and we can return if there is one then
40:18
again what we want to wait for then
40:21
what's going on is that these
40:23
sequentially numbered files are setting
40:25
up the order in which the lock is going
40:28
to be granted to the different clients
40:30
so if we're not the winner of the lock
40:33
what we need to do is wait for the
40:35
previously numbered with the client who
40:39
created the previously numbered file to
40:41
release to acquire and then release the
40:43
lock and we're going to release the lock
40:45
the convention for releasing the locking
40:47
in this system is for
40:49
remove the file to remove your
40:51
sequential file so we want to wait for
40:53
the previously numbered sequential file
40:56
to be deleted and then it's our turn and
40:59
we get the lock so we need to call
41:01
exists so we're gonna say if the call
41:05
exists mostly to set a watch point
41:09
so it's you know next lower number file
41:16
and we want to have a watch get that
41:23
file still exist we're gonna wait and
41:25
then so that's step 5
41:28
and then finally we're gonna go back to
41:32
we're not going to create the file again
41:33
because it already exists we're gonna go
41:35
back to listing the yeah the files so
41:41
this is a choir releases just I delete
41:44
if I acquire the lock I delete my the
41:47
file I created complete with my number
41:50
yes
41:54
why do you need to list the files again
42:02
that's a good question so the question
42:03
is we got the list of files we know the
42:08
next lower number file there's a
42:11
guarantee of the sequential file
42:12
creation is that once filed 27 is
42:15
created no file with a lower number will
42:18
ever subsequently be created so we now
42:20
know nothing else could sneak in here so
42:22
how could the next lower number file you
42:25
know why why do we need to list again
42:27
why don't we just go back to waiting for
42:29
that same lower numbered file thing
42:34
Britney guess the answer
42:43
I mean the the the way this code works
42:46
the answer to the question is whoever
42:49
was the next lowered person might have
42:51
either acquired him at least the lock
42:53
before we noticed or have died and this
42:59
went and these are transient files sorry
43:04
or whatever they're called ephemeral
43:06
there's an ephemeral file you know even
43:13
if we're 27th in line number 26 may have
43:17
died before getting the lock if number
43:19
26 dies the system automatically deletes
43:22
their ephemeral files and so if that
43:25
happened now we need to wait for number
43:27
25 that is the next you know it if all
43:31
files you know 2 through 27 and and
43:33
we're 27 if they're all they are and
43:35
they're all waiting there's a lock if if
43:37
the one before is dies before getting
43:39
the lock now we need to wait for the
43:41
next next lower number file not because
43:43
the next lower one is has gone away so
43:47
that's why we have to go back and relist
43:48
the files in case our predecessor in the
43:50
list of waiting clients turned out to
43:53
die yes
44:02
if there's no lower numbered file than
44:04
you have acquired the lock absolutely
44:09
yes how does this not suffer from the
44:15
herd effect suppose we have a thousand
44:20
clients waiting and currently client
44:22
made through the first five hundred and
44:24
client five hundred holds the lock every
44:30
client waiting every client is sitting
44:31
here waiting for an event but only the
44:36
client that created file five hundred
44:38
and one he's waiting for the vision of
44:41
file five hundred so everybody's waiting
44:44
for the next lower number so five
44:45
hundred is waiting for 499 twenty nine
44:48
nine but everybody everybody's waiting
44:51
for just one file when I release the
44:53
lock there's only one other client the
44:56
next higher numbered client that's
44:57
waiting for my file so when I release
44:59
the lock one client gets a notification
45:02
one client goes back and lists the files
45:07
one client and one client now has the
45:10
lock so the sort of expense you know no
45:14
matter how many clients that are the
45:15
expense of one of each release and
45:18
acquire is a constant number of our PCs
45:22
where's the expense of a release and
45:26
acquire here is that every single
45:28
waiting client is notified and every
45:31
single one of them sends a write request
45:33
than the create request into zookeeper
45:42
oh you're free to get a cup of coffee
45:54
yeah I mean this is you know what the
45:57
programming interface looks like is not
46:00
our business but this is either and
46:03
there's there's two options for what
46:05
this actually means as far as what the
46:07
program looks like one is there's some
46:09
thread that's actually in a synchronous
46:11
wait it's made a function call saying
46:13
please acquire this lock and the
46:14
function hold doesn't return until the
46:16
locks finally acquired or the
46:17
notification comes back of much more
46:20
sophisticated interface would being one
46:22
in which you fire off requests a
46:23
zookeeper and don't wait and then
46:25
separately there's some way of seeing
46:28
well as you keep your said anything
46:29
recently or I have some go routine whose
46:32
job it is just wait for the next
46:34
whatever it is from zookeeper in the
46:36
same sense that you might read the apply
46:39
Channel and just all kinds of
46:40
interesting stuff comes up on the apply
46:41
channel so that's a more likely way to
46:44
structure this but yeah you're totally
46:45
either through threading or some sort of
46:48
event-driven thing you can do something
46:51
else while you're waiting yes yes or if
47:06
the person before me has neither died
47:11
nor released it's a file before me
47:17
exists that means either that client is
47:20
still alive and still waiting for the
47:22
lock or still alive and holds the lock
47:25
we don't really know
47:35
it does it as long as that client 500
47:38
still live if if this exists fails that
47:42
means one of two things either my
47:43
predecessor held the lock and is
47:45
released it and deleted their file or my
47:47
predecessor didn't hold the lock they
47:49
exited and zookeeper deleted their file
47:52
because it was an ephemeral file so
47:55
there's two reasons to come out of this
47:58
to come out of his weight or four they
48:01
exist to return false and that's why we
48:03
have to like we check everything you
48:08
know you really don't know what the
48:09
situation is after the exists completes
48:30
that might that yeah maybe maybe that
48:32
could need to work that sounds
48:33
reasonable
48:34
and it preserves the sort of scalable
48:38
nature of this and that each require
48:39
release only involves a few clients two
48:43
clients
48:48
alright this pattern to me actually
48:52
first saw this pattern a totally
48:54
different context and scalable locks for
48:56
threading systems I go this end in for
49:01
most of the world this is called a scale
49:02
of a lock
49:10
I find it one of those interesting
49:12
constructions I've ever seen now and so
49:18
like I'm impressed that zookeeper is
49:20
able to express it and it's a valuable
49:22
construct having said that I'm a little
49:28
bit at sea about why zookeeper about why
49:31
the paper talks about locks at all
49:33
because these locks these locks are not
49:38
like threading locks and go because in
49:41
threading there's no notion of threads
49:43
failing at least if you don't want them
49:45
there to be there's no notions of
49:46
threads just sort of randomly dying and
49:48
go and so really the only thing you're
49:50
getting out of a mutex it's really the
49:52
case and go that when you use it if
49:54
everybody uses mutexes correctly you are
49:56
getting atomicity for the sequence of
49:59
operations inside the mutex that you
50:02
know if you take out a lock and go and
50:04
you do 47 different read and write a lot
50:06
of variables and then release the lock
50:07
if everybody follows that locking
50:09
strategy nobody's ever going to see some
50:12
sort of weird intermediate version of
50:14
the data as of halfway through you're
50:16
updating it right just makes things
50:18
atomic no argument these locks aren't
50:20
really like that because if the client
50:22
that holds the lock fails it just
50:25
releases the lock and somebody else can
50:28
pick up the lock so it does not
50:30
guarantee atomicity because you can get
50:33
partial failures and distributed systems
50:35
where you don't really get partial
50:37
failures of ordinary threaded code so if
50:41
the current lock holder had the lock and
50:43
needed to update a whole bunch of things
50:45
that were protected by that lock before
50:46
releasing and only got halfway through
50:48
updating this stuff and then crashed
50:51
then the lock will get released you'll
50:53
get the lock and yet when you go to look
50:55
at the data it's garbage because it's
50:58
just whatever random seed it was in the
51:00
middle of
51:00
updated so there's these locks don't by
51:04
themselves provide the same atomicity
51:06
guarantee that threading locks do and so
51:09
we're sort of left to imagine for
51:11
ourselves by the paper or why you would
51:13
want to use them or why this is the sort
51:15
of some of the main examples in the
51:16
paper so I think if you use locks like
51:21
this then you sort in a distributed
51:22
system then you have two general options
51:24
one is everybody who acquires a lock has
51:28
to be prepared to clean up from some
51:30
previous disaster right so you acquire
51:33
this lock you look at the data you try
51:35
to figure out gosh if the previous owner
51:37
of a lot crashed
51:38
you know when I'm looking at the data
51:41
you know how can I fix the data to make
51:44
up how can I decide if the previous
51:46
owner crashed and what do I do to fix up
51:48
the data and you can play that game
51:51
especially if the convention is that you
51:55
always update in a particular sequence
51:57
you may be able to detect where in that
51:58
sequence the previous holder crashed
52:00
assuming they crashed but it's a you
52:04
know it's a tricky game the requires
52:05
thought of a kind you don't need for
52:07
like thread locking um the other reason
52:10
maybe these locks would make sense is if
52:12
there's sort of soft locks protecting
52:16
something that doesn't really matter
52:17
so for example if you're running
52:20
MapReduce jobs map tasks reduce tasks
52:24
you could use this kind of lock to make
52:26
sure only one task only one worker
52:30
executed each task so workers gonna run
52:33
test 37 it gets the lock for task 37
52:36
execute it marks it as executed and
52:38
releases it well the way not produce
52:42
works it's actually proof against
52:44
crashed workers anyway so if you grab a
52:49
lock and you crash halfway through your
52:51
MapReduce job so what the next person
52:53
who gets the lock you know because your
52:55
lock will be released when you crash the
52:56
next version who gets it will see you
52:57
didn't finish the task and just we
52:59
execute it and it's just not a problem
53:01
because of the way MapReduce is defined
53:03
so you could use these locks or some
53:05
kind of soft lock thing although anyway
53:09
and you know maybe the other thing which
53:11
we should be thinking about is that some
53:13
version of this
53:14
be used to do things like elect a master
53:17
but if what we're really doing here is
53:19
electing a master you know we could use
53:22
code much like this and that would
53:23
probably be a reasonable approach yeah
53:25
oh yeah yeah yeah so the picking of
53:42
paper talk that remember the text in the
53:43
paper were says it's going to delete the
53:45
ready file and then do a bunch of
53:47
updates to files and then recreate the
53:49
ready file that would that is a
53:51
fantastic way of sort of detecting and
53:55
coping with the possibility that the
53:57
previous lock held or the previous
53:58
master or whoever it is crashed halfway
54:01
through because gosh the ready file has
54:02
never be created
54:18
Inigo program yeah sadly that is
54:21
possible and you know either okay so the
54:25
question is nothing about zookeeper but
54:27
if you're writing threaded code and go a
54:29
thread acquires a lock could it crash
54:32
while holding the lock halfway through
54:34
whatever stuff it's supposed to be doing
54:37
while holding a lock and the answer is
54:38
yes actually there are there are ways
54:40
for an individual thread to crash and go
54:42
oh I forget where they are maybe divide
54:45
by zero certain panics anyway you can do
54:48
it and my advice about how to think
54:54
about that is that the program is now
54:56
broken and you've got to kill it because
55:00
in threaded code the way the thing about
55:02
locks is that while the lock is held the
55:06
invariants in the data don't hold so
55:09
there's no way to proceed if the lock
55:12
holder crashes there's no safe way to
55:15
proceed because all you know is whatever
55:17
the invariants were that the lock was
55:18
protecting no longer hold so and so and
55:23
if you do want to proceed you have to
55:24
leave the lock marked as held so that no
55:27
one else will ever be able to acquire it
55:29
and you know unless you have some clever
55:31
idea that's pretty much the way you have
55:33
to think about it in a threaded program
55:35
because that's kind of the style with
55:37
which people write threaded lock
55:38
programs if you're super clever you
55:40
could play the same kinds of tricks like
55:44
this ready flag trick now it's super
55:49
hard and go because the memory model
55:51
says there is nothing you can count on
55:54
except if there's a happens before
55:56
relationship so if you play this game of
55:59
writing changing some variables and then
56:00
setting a done flag that doesn't mean
56:04
anything unless you release a lock and
56:08
somebody else acquires a lock and only
56:10
then can anything be said about the
56:13
order in which or in even whether the
56:15
updates happen so this is very very hard
56:18
it rivairy hard and go to recover from a
56:21
crash of a thread that holds the lock
56:25
here is maybe a little more possible
56:31
okay okay okay that's all I want to talk
56:42
about with zoo keeper
56:44
it's just two pieces of high bid one is
56:46
at these clever ideas for high
56:48
performance by reading from any replica
56:50
but the they sacrifice a bit of
56:52
consistency and the other interesting
56:55
thing uninteresting take-home is that
56:57
they worked out this API that really
56:59
does let them be a general-purpose sort
57:02
of coordination service in a way that
57:04
simpler schemes like put get interfaces
57:06
just can't do so they worked out a set
57:09
of functions here that allows you to do
57:11
things like write mini transactions and
57:13
build your own locks and it all works
57:15
out although requires care okay now I
57:22
want to turn to today's paper which is
57:24
crack the the reason why we're reading a
57:34
crack paper it's a couple reasons one is
57:39
is that it's it does replication for
57:41
fault tolerance and as we'll see the
57:43
properties you get out of crack or its
57:46
predecessor chain replication are very
57:49
different in interesting ways from the
57:52
properties you get out of a system like
57:54
raft and so I'm actually going to talk
57:58
about so crack is sort of an
58:00
optimization to an older scheme called
58:01
chain replication chain replications
58:08
actually fairly frequently used in the
58:11
real world there's a bunch of systems
58:12
that use it
58:14
crack is an optimization to it that
58:16
actually does a similar trick -
58:18
zookeeper where it's trying to increase
58:20
weed throughput by allowing reads to two
58:24
replicas to any replicas so that you get
58:26
you know number of replicas factor of
58:29
increase in the read performance the
58:32
interesting thing about crack is that it
58:34
does that while preserving
58:39
linearise ability
58:41
unlike zookeeper which you know it
58:43
seemed like in order to be able to read
58:44
from any replica they had to sacrifice
58:46
freshness and therefore snot
58:47
linearizable crack actually manages to
58:50
do these reads from any replica while
58:53
preserving strong consistency I'm just
58:56
pretty interesting okay so first I want
59:00
to talk about the older system chain
59:01
replication teen replication is a it's
59:10
just a scheme for you have multiple
59:11
copies you want to make sure they all
59:13
seen the same sequence of right so it's
59:14
like a very familiar basic idea but it's
59:17
a different topology then raft so the
59:21
idea is that there's a chain of servers
59:25
and chain replication and the first one
59:29
is called the head last one's called the
59:32
tail when a right comes in when a client
59:36
wants to write something say some client
59:39
it sends always Albright's get sent to
59:42
the head the head updates its or
59:46
replaces its current copy of the data
59:48
that the clients writing so you can
59:49
imagine be go put key value store so you
59:54
know if everybody started out with you
59:56
know version a of the data and under
59:58
chain replication when the head process
60:01
is the right and maybe we're writing
60:02
value B you know the head just replaces
60:04
its a with a B and passes the right down
60:07
the chain as each node sees the right it
60:11
replaces over writes its copy the data
60:13
the new data when the right gets the
60:17
tail the tail sends the reply back to
60:21
the client saying we completed your
60:23
right
60:25
that's how rights work reads if a client
60:30
wants to do a read it sends the read to
60:33
the tail the read request of the tail
60:35
and the tail just answers out of its
60:38
current state so if we ask for this
60:40
whatever this object was the tail which
60:42
is I hope current values be weeds are a
60:45
good deal simpler
60:52
okay so it should think for a moment
60:55
like why to chain chain replication so
60:59
this is not crack just to be clear this
61:01
is chain replication chain replication
61:03
is linearizable you know in the absence
61:08
of failures what's going on is that we
61:10
can essentially view it as really than
61:12
the purposes of thinking about
61:14
consistency it's just this one server
61:16
the server sees all the rights and it
61:19
sees all the reads and process them one
61:21
at a time and you know a read will just
61:24
see the latest value that's written and
61:25
that's pretty much all there is to it
61:27
from the point of view look if there's
61:29
no crashes what the consistency is like
61:34
pretty simple the failure recovery the a
61:45
lot of the rationale behind chain
61:47
replication is that the set of states
61:51
you can see when after there's a failure
61:53
is relatively constrained because of
61:56
this very regular pattern with how the
61:58
writes get propagated and at a high
62:01
level what's going on is that any
62:03
committed write that is any rate that
62:05
could have been acknowledged to a client
62:07
to the writing client or any rate that
62:09
could have been exposed in a read
62:12
that'll neither of those will ever
62:14
happen unless that write reached the
62:16
tail in order for it to reach the tail
62:17
it had to a pass through them in process
62:19
by every single node in the chain so we
62:22
know that if we ever exposed to write
62:24
ever acknowledged write ever use it to a
62:26
read that means every single node in the
62:29
tail must know about that right we don't
62:33
get these situations like if you'll call
62:34
figure seven figure eight and RAF paper
62:37
where you can have just hair-raising
62:39
complexity and how the different
62:41
replicas differ if there's a crash here
62:44
you know either that it is committed or
62:47
it before the crash should reach some
62:49
point and nowhere after that point
62:52
because the progress of rights has
62:53
always menu so committed rights are
62:55
always known everywhere if a right isn't
62:57
committed that means that before
62:58
whatever crash it was that disturb the
63:00
system the rate of got into a certain
63:01
point everywhere before that point and
63:04
nowhere after
63:04
point there's really the only two setups
63:07
and at a high level failure recovery is
63:12
relatively simple also if the head fails
63:16
then to a first approximation the next
63:19
node can simply take over his head and
63:21
nothing else needs to get done because
63:24
any rate that made it as far as the
63:27
second node while it was the head that
63:28
failed so that right will keep on going
63:30
and we'll commit if there's a rate that
63:32
made it to the head before a crash but
63:34
the head didn't forward it well that's
63:36
definitely not committed nobody knows
63:38
about it and we definitely didn't send
63:39
it an acknowledgment to the writing
63:41
client because the write didn't get down
63:43
here so we're not obliged to do anything
63:45
about a write it only reached a crashed
63:47
head before it failed I may be the
63:50
client where we sinned but you know not
63:52
our problem if the tale fails it's
63:56
actually very similar the tale fails the
63:58
next node can directly take over because
64:01
everything the tale knew then next the
64:04
node just before it also knows because
64:06
the tale only hears things from the node
64:08
just before it and it's a little bit
64:14
complex of an intermediate node fails
64:16
but basically what needs to be done is
64:18
we need to drop it from the chain and
64:20
now there may be rights that it had
64:22
received that the next node hasn't
64:24
received yet and so if we drop a note
64:27
out of the chain the predecessor may
64:29
need to resend recent rights to the to
64:33
its new successor right that's the
64:37
recovery in a nutshell that's for why
64:41
this construction why this instead of
64:45
something else like why this verse is
64:47
wrapped for example the performance
64:54
reason is that in raft if you recall we
64:59
you know if we have a leader and a bunch
65:01
of you know some number of replicas
65:03
right with the leader it's not in a
65:05
chain we got these the replicas are all
65:07
directly fed by the leader so if a
65:09
client right comes in or a client read
65:11
for that matter the the leader has to
65:15
send it itself to each of the replicas
65:18
whereas in chain replication the leader
65:20
on the head only has to do once and
65:22
these cents on the network are actually
65:23
reasonably expensive and so that means
65:26
the load on a raft leader is going to be
65:28
higher than the load on a chain
65:31
replication leader and so that means
65:33
that you know as the number of client
65:37
requests per second that you're getting
65:39
from clients goes up a raft leader will
65:41
hit a limit and stop being able to get
65:44
faster sooner than a chain replication
65:46
head because it's doing more work than
65:49
the chain replication had another
65:51
interesting difference between chain
65:53
replication and raft is that the reeds
65:55
in raft are all also required to be
65:59
processed by the leaders the leader sees
66:01
every single request from clients
66:02
where's here the head sees everybody
66:04
sees all the rights but only a tail sees
66:08
the reed requests so there may be an
66:11
extent to which the load is sort of
66:12
split between the head and the tail
66:13
rather than concentrated in the leader
66:17
and and as I mentioned before the
66:24
failure different sort of analysis
66:27
required to think about different
66:28
failure scenarios is a good deal simpler
66:30
and chain replication than it is and
66:32
raft and as a big motivation because
66:35
it's hard to get this stuff correct yes
66:45
yeah so if the tale fails but its
66:48
predecessor had seen a right that the
66:50
tale hadn't seen then the failure of
66:52
that Hale basically commits that right
66:54
is now committed because it's reached
66:56
the new tale and so he could respond to
66:58
the client it probably won't because it
67:00
you know it wasn't a tail when it
67:04
received the right and so the client may
67:07
resend the right and that's too bad and
67:08
so we need duplicate suppression
67:10
probably at the head basically all the
67:14
systems were talking about require in
67:16
addition to everything else suppression
67:19
of duplicate client requests yes pink
67:32
psyche setting in you want to know who
67:39
makes the decisions about how to that's
67:42
a outstanding question the question is
67:45
or rephrase the question a bit if
67:48
there's a failure like or suppose the
67:51
second node stops being able to talk to
67:54
the head can this second node just take
67:58
over can it decide for itself gosh the
68:01
head seems to thought away I'm gonna
68:02
take over his head and tell clients to
68:04
talk to me instead of the old head but
68:06
what do you think that's not like a plan
68:15
with the usual assumptions we make about
68:17
how the network behaves that's a recipe
68:20
for split brain right if you do exactly
68:24
what I said because of course what
68:26
really happened was that look the
68:28
network failed here the head is totally
68:31
alive and the head thinks its successor
68:33
has died you know the successors
68:35
actually alive it thinks the head has
68:37
died and they both say well gosh that
68:39
other server seems to have died I'm
68:40
gonna take over and the head is gonna
68:42
say oh I'll just be a sole replica and I
68:44
you know act as the head and the tail
68:47
because the rest of the change seems to
68:49
have gone away and second I'll do the
68:50
same thing and now we have two
68:51
independent split brain versions of the
68:55
data which will gradually get out of
68:57
sync so this construction is not proof
69:04
against network partition and has not
69:09
does not have a defense against split
69:10
brain and what that means in practice is
69:13
if it cannot be used by itself it's like
69:16
a helpful thing to have in our back
69:18
pocket but it's not a complete
69:19
replication story so it's it's very
69:24
commonly used but it's used in this
69:26
stylized way in which there's always an
69:28
external Authority you know not not this
69:33
chain that decides who's that sort of
69:36
makes a call on who's alive and who's
69:40
dead and make sure everybody agrees on a
69:43
single story about who constitutes the
69:46
change there's never any disagreement
69:48
some people think the change is this no
69:49
and some people think the chain is this
69:51
other node so what's that's usually
69:53
called as a configuration manager and
70:00
its job is just a monitor aliveness and
70:02
every time it sees of all the servers
70:05
every time Isis every time the
70:06
configuration manager thinks the
70:08
server's dead it sends out a new
70:10
configuration in which you know that
70:13
this chain has a new definition had
70:16
whatever tail and that's server that the
70:19
configuration manager thinks is that may
70:21
or may not be dead but we don't care
70:22
because everybody is required to follow
70:25
then your configuration
70:26
and so there can't be any disagreement
70:29
because there's only one party making
70:31
these decisions not going to disagree
70:33
with itself of course how do you make a
70:35
service that's fault tolerant and
70:36
doesn't disagree with itself but doesn't
70:38
suffer from split brain if there's
70:39
network partitions and the answer to
70:41
that is that the configuration manager
70:43
usually uses wrath or paxos or in the
70:49
case of crack zookeeper which itself of
70:52
course is built on a raft like scheme so
70:56
so you to the usual complete set up in
71:00
your data center is it you have a
71:01
configuration manager it's it's based on
71:04
or after PACs or whatever so it's fault
71:06
tolerant and does not suffer from split
71:09
brain and then you split up your data
71:11
over a bunch of change if you know room
71:13
with a thousand servers in it and you
71:15
have you know chain a you know it's
71:20
these servers or the configuration
71:22
manager decides that the change should
71:25
look like chain a is made of server one
71:28
server to server three chain be you know
71:32
server for server 5 over 6 whatever and
71:36
it tells everybody this whole list it's
71:38
all the clients know all the servers
71:40
know and the individual servers opinions
71:44
about whether other servers are alive or
71:46
dead are totally neither here nor there
71:48
if this server really does die then then
71:54
the head is required to keep trying
71:56
indefinitely until I guess a new
71:58
configuration from the configuration
72:00
manager not allowed to make decisions
72:02
about who's alive and who's dead what's
72:07
that
72:09
oh boy you've got a serious problem so
72:12
that's why you replicated using raft
72:14
make sure the different replicas are on
72:15
different power supplies the whole works
72:18
but this this construction I've set up
72:22
here it's extremely common and it's how
72:24
chain replication is intended to be used
72:26
how cracks intend to be used and the
72:29
logic of it is that like chain require
72:33
replication if you don't have to worry
72:35
about partition and split brain you can
72:37
build very high speed efficient
72:39
replication systems using chain
72:41
replication for example so these
72:43
individual you know data replication and
72:48
we're sharding the data over many chains
72:50
individually this these chains can be
72:52
built to be just the most efficient
72:54
scheme for the particular kind of thing
72:56
that you're replicating you may read
72:58
heavy right heavy whatever but we don't
73:00
have to worry too much about partitions
73:01
and then all that worry is concentrated
73:04
in the reliable non split-brain
73:07
configuration manager
73:17
okay so your question is why are we
73:23
using chain replication here instead of
73:25
raft okay so it's like a totally
73:33
reasonable question um the the it
73:39
doesn't really matter for this
73:40
construction because even if we're using
73:42
raft here we still need one party to
73:48
make a decision with which there can be
73:51
no disagreement about how the data is
73:53
divided over our hundred different
73:57
replication groups right so all you know
73:59
and I need kind of big system you're
74:01
splitting your sharding or splitting up
74:02
the data
74:03
somebody needs to decide how the data is
74:04
assigned to the different replication
74:07
groups this has to change over time as
74:08
you get more or less Hardware more data
74:10
or whatever so if nothing else the
74:12
configuration manager is saying well
74:14
look you know the keys start with a or B
74:16
goes here or then C or D goes here even
74:20
if you use Paxos here now there's also
74:22
this smaller question if we didn't eat
74:23
you know what should we use for
74:24
replication should be chain replication
74:26
or paxos or raft or whatever and people
74:33
do different things some people do
74:36
actually use Paxos based replication
74:38
like spanner which I think we're gonna
74:40
look at later in the semester has this
74:43
structure but it actually uses Paxos to
74:45
replicate rights for the data you know
74:49
the reason why you might not want to use
74:50
PAC so so raft is that it's arguably
74:55
more efficient to use this chain
74:57
construction because it reduces the load
74:59
on the leader and that may or may not be
75:01
a critical issue the a reason to favor
75:07
rafter Paxos is that they do not have to
75:11
wait for a lagging replica this chain
75:14
replication has a performance problem
75:15
that if one of these replicas is slow
75:18
because even for a moment
75:20
you know because every rate has to go
75:22
through every replica even a single slow
75:24
replica slows down all offer all right
75:27
operations and I can be very damaging
75:29
you know if you have thousands of
75:30
servers probably did any given time you
75:32
know seven of them are out to lunch or
75:37
unreliable or slow because somebody's
75:39
installing new software who knows what
75:41
and that so it's a bit damaging to have
75:46
every request be sort of limited by the
75:48
slowest server whereas brafton paxos
75:52
well it's so rad for example if one of
75:54
the followers is so it doesn't matter
75:56
because that leader only has to wait for
75:57
a majority it doesn't have to wait for
75:59
all of them you know ultimately they all
76:01
have to catch up but raft is much better
76:04
resisting transient slowdown and some
76:07
Paxos based systems although not really
76:09
raft are also good at dealing with the
76:12
possibility that the replicas are in
76:14
different data centers and maybe far
76:16
from each other and because you only
76:17
need a majority you don't have to
76:19
necessarily wait for acknowledgments
76:21
from a distant data center and so that
76:23
can also leads people to use paxos raft
76:25
like majority schemes rather than chain
76:28
replication but this is sort of a it
76:31
depends very much on your workload and
76:34
what you're trying to achieve but this
76:35
overall architecture is in I don't know
76:41
if it's Universal but it's extremely
76:42
common
77:02
like intentional topologies okay the for
77:07
a for a network that's not broken the
77:10
usual assumption is that all the
77:12
computers can talk to each other through
77:14
the network for networks that are broken
77:16
because somebody stepped on a cable or
77:18
some routers misconfigured any crazy
77:22
thing can happen
77:23
so absolutely due to miss configuration
77:27
you can get a situation where you know
77:30
these two nodes can talk to the
77:31
configuration manager and the
77:33
configuration managers think sir they're
77:34
up but they can't talk to each other so
77:38
yes and and that's a killer for this
77:41
right because other configuration
77:42
manager thinks that are up they can't
77:44
talk to each other boy it's just like
77:46
it's a disaster and if you need your
77:50
system to be resistant to that then you
77:52
need to have a more careful
77:53
configuration manager you need logic in
77:56
the configuration manager that says gosh
77:57
I'm only gonna form a chain out of these
77:59
services not only I can talk to that but
78:01
they can talk to each other and sort of
78:03
explicitly check and I don't know if
78:05
that's common I mean I'm gonna guess not
78:07
but if you were super careful you'd want
78:10
to because even though we talked about
78:11
network partition that's like a
78:13
abstraction and in reality you can get
78:16
any combination of who can talk to who
78:19
else and some are may be very damaging
78:24
okay I'm gonna wrap up and see you next
78:28
week