字幕記錄


00:00
good evening good night wherever you are
00:02
let's get started
00:04
so today i want to talk a little bit
00:06
about the chain replication
00:08
the paper aside for today from 2004.
00:12
uh before diving though into the paper
00:14
with a couple quick uh logistic things
00:16
uh just i want to remind you of uh one
00:19
is we have a quiz on thursday
00:23
uh and you know the instructions uh of
00:26
the topics that's actually being covered
00:28
the quizzes are on the schedule page uh
00:30
we'll try to send out uh an announcement
00:32
on piazza with more details
00:34
about exactly how we'll do the quiz
00:37
uh it's going to be a great scope and
00:41
it's you know basically during class
00:43
hours or 80 minutes um
00:45
and more but more details to follow
00:48
the second uh thing i want to remind
00:51
people off
00:52
is projects
00:54
[Music]
00:56
if you would like to do a project
00:58
instead of lab four
01:00
uh then um
01:03
uh you can do so uh but you should
01:06
submit a proposal for a project and
01:08
just a couple paragraphs uh to uh
01:11
through the submission website uh so
01:13
that we can give you feedback and so to
01:15
tell you whether this project is
01:16
actually
01:16
appropriate for a final project in eight
01:19
to four
01:20
uh if you're just planning to do log
01:21
four there's absolutely nothing you have
01:23
to do at all
01:25
uh any questions about these sort of two
01:27
logistical
01:28
uh points
01:36
okay then uh let me move on to one other
01:40
point that i wanted to bring up which is
01:42
the uh sort of a correction
01:44
uh from uh uh
01:48
a lecture a little while ago uh we uh we
01:51
walked through the
01:53
go code for the raft implementation of
01:56
2a and 2b that i had and uh we talked a
01:59
little bit about the
02:00
go defer statement and i mentioned that
02:04
you know you can actually have
02:05
the first statement inside of this block
02:07
um and
02:08
that is correct and i think it was maybe
02:10
a philippe who asked the question oh you
02:12
know what is exactly
02:13
what is that that can execute it and i
02:15
answered that question incorrectly
02:16
uh the first statement gets executed at
02:19
the point they get you return out of the
02:20
function not in your turn out of the
02:22
basic block
02:23
and so i apologize if that caused any
02:26
confusion
02:29
any questions about the
02:33
that clarification
02:37
okay good then those are the two topics
02:40
and really the
02:40
two technical topics i want to talk
02:42
about today is the zookeeper logs
02:44
which i didn't get to finish the last
02:46
time
02:48
and then i'll talk about chain
02:49
replication
02:51
all right so uh both for chain uh
02:55
replication and zookeeper you know we're
02:57
sort of still
02:58
in the same context as uh before
03:01
namely uh we're doing you know
03:04
replicated state machines
03:07
and you know just the usual diagram
03:10
uh you know we have a service you know
03:12
that runs on some you know replication
03:14
library like you know zap
03:16
or raft
03:20
we have clients you know talking to uh
03:22
the service for example in the case of
03:24
you know
03:24
a zookeeper they might actually send the
03:26
create call
03:28
you know if you keep your internally
03:29
house from state you know some z
03:31
notes you know that are hanging off in a
03:33
form a tree
03:34
and so when an operation comes in you
03:36
know superkeeper you know
03:38
forwards that operation basically to the
03:40
rav zab library
03:41
it does some chatting back and forth and
03:44
to
03:44
get a majority of the servers to accept
03:47
you know that uh
03:48
command and then at some point you know
03:50
once it's accepted it comes out
03:51
uh the server applies the operation and
03:54
sends a response back to the client
03:56
and so the standard story of uh
03:58
replicate the state machine if you start
04:00
all the replicated state machines start
04:01
in the same state you apply the same
04:03
operation in the same order
04:05
then you will end up in the same state
04:06
and so any of the machines can then take
04:08
over if necessary
04:11
now one of the things that was cool
04:13
about or interesting about
04:14
zookeeper is that you know read
04:16
operations can uh be served from any
04:18
uh service or for any of the peers or
04:21
any of the one of the servers
04:23
and this allows you keeper to get
04:25
extremely high performance you know
04:26
because you can actually uh
04:28
scale the number of viewed operations
04:30
with the number of servers
04:32
uh as a flip side of that uh you know
04:35
zookeeper
04:36
actually gave up on in that particular
04:38
scenario gave up on linearizability
04:49
because we know from you know for
04:51
example in raft you know you can't
04:52
actually
04:53
arbitrarily serve a weed from any server
04:57
because i have not seen the latest
04:59
updates yet
05:00
and so in case of zookeeper that is true
05:03
too
05:04
and uh and so read operations
05:07
are not you know or the the operation
05:10
that supergroup defines
05:11
don't provide a linearizable interface
05:13
um but nevertheless you know we saw
05:16
that actually is uh it provides a
05:17
slightly different sort of correctness
05:19
guarantee than linearizability
05:21
uh and that actually correctness um
05:24
is useful uh and useful enough you know
05:26
to be able to write x-men programs and
05:28
the particular
05:29
set of programs that you know zookeeper
05:31
was you know focusing on
05:33
is uh these what they call configuration
05:38
or coordination
05:50
programs
05:52
and so the definitely think about it is
05:53
that a lot of systems that we
05:55
looked at in the past they typically
05:57
have you know some replication story and
05:58
then they have
05:59
a coordinator or a master
06:02
uh that sort of coordinates the the
06:04
group and
06:05
zookeep is really intended you know for
06:07
as a service you know for that kind of
06:09
you know master or coordinator role and
06:12
it provides a bunch of primitives you
06:14
know to
06:14
make that uh doable and uh the two
06:18
you know we talked a little bit about
06:19
atomic increment uh last week
06:22
uh and uh with some of the other
06:24
surfaces and i want to sort of
06:25
finish off you know talking about box um
06:30
one because there were a lot of
06:31
questions about it and two it actually
06:32
is quite interesting
06:33
and there's sort of two different
06:35
walking team limitations and the first
06:36
one to talk about the simple one
06:38
namely where uh let me just write down
06:41
the
06:41
pseudo code and then we can talk about
06:43
it in a little bit in detail
06:45
uh so the pseudo total code for
06:49
the block was something like this you
06:51
know acquire
06:52
uh in an infinite loop you know try to
06:55
create the log
06:55
file
07:00
i'll name it lf and set ephemeral to
07:04
true
07:11
and we'll talk about that in a second
07:12
why you know if the create succeeds
07:16
then you know the that process where
07:18
that client was the first one to create
07:20
that file and basically
07:21
successfully gets the lock and so it
07:24
breaks out of the for loop and returns
07:26
um if the
07:30
client did not you know be able to
07:31
create a file then a call
07:33
exists and the call to exist is not
07:36
really for
07:37
you know see if it exists or not because
07:39
you know we know it doesn't exist
07:41
but basically we set a watch
07:47
and the idea is that you know the watch
07:50
will go off
07:51
if actually the file disappears
07:54
and if it disappears then you know the
07:56
client will get a notification and so
07:57
basically
07:58
we're going to be doing here is just
08:00
wait for that notification
08:09
so that's sort of the acquire operation
08:11
then the release
08:13
is very simple
08:16
the release basically is does nothing
08:18
else than sending a delete
08:20
operation to the zookeeper service you
08:22
know for the for the log file
08:25
so lf in this case and so what does that
08:28
do well if the
08:30
delete you know sent to the zookeeper
08:31
servers you know the zookeeper serves
08:33
the
08:34
performance the ddop delete operation
08:36
that will actually
08:39
let the that make all the file would go
08:41
away uh that will fire
08:43
the watch and so every client that
08:45
actually is waiting
08:46
you know for uh the notification will
08:48
get a notification
08:50
and then they go retry and uh you know
08:53
one of them will be successful on the
08:54
retry we'll get the log file or create
08:56
the lock file
08:57
and then proceed and then the others
08:58
ones you know we'll go back into
09:00
uh their uh call exists and i'm gonna
09:04
wait for
09:05
a notification and the zookeeper
09:08
semantics you know are good enough
09:10
you know the the strong the the the the
09:12
limit the linearizability for write
09:14
operations
09:15
plus you know the rules for when
09:17
notifications go off
09:18
are strong enough that basically this
09:20
actually implements a faithful lock
09:21
where only
09:22
one client if there are many clients at
09:24
the same time trying to get the lock
09:25
only one will get it
09:27
and you know when the release is done or
09:29
when the
09:30
file has been deleted only one in the
09:32
next round will get it
09:35
so that's sort of cool uh and it's
09:37
interesting that you should
09:38
build uh you know sort of a sort of this
09:41
foundational primitive
09:42
uh using the primitives that the
09:44
zookeeper offers
09:46
and you see above the role of the you
09:49
know the the watch
09:50
and then there's the second row for the
09:52
elephant you know the ephemeral
09:54
is there because uh what happens if a
09:57
client
09:57
uh fails or crashes before it causes
09:59
release and
10:01
the semantics of the femoral files is
10:03
that the zookeeper server will
10:05
if it decides the client has crashed it
10:07
will do the operation
10:09
uh it will remove the file on behalf of
10:12
the
10:12
client so even if the client fails or
10:15
crashes
10:16
the search of the site and at some point
10:18
the client is done
10:20
and then we'll remove you know this file
10:21
lock f which will cause notifications to
10:24
be sent to
10:24
other clients that actually are waiting
10:26
for it uh so
10:28
it's a you know a cool set of primitives
10:30
to build in a powerful abstraction that
10:33
can be useful in applications one
10:36
downside of this particular
10:37
implementation
10:40
is that it has what's called a herding
10:42
effect
10:44
namely you know let's say you have a
10:46
thousand clients don't want to
10:48
grab the log file or make the log file
10:50
acquire the lock
10:52
uh no one is going to succeed and 999
10:55
are going to call uh exists and wait for
10:58
a notification
10:59
then when the first client deletes the
11:02
file or releases the file lock
11:05
999 are going to try to actually acquire
11:09
the lock and of course only one is going
11:11
to succeed and then 9998 they're going
11:13
to be sitting for a notification
11:15
uh but you know the the this you know
11:18
every sort of
11:19
round of disappearance a huge amount of
11:21
traffic uh
11:22
and uh you know basically bombarding you
11:25
know the zookeeper surface right because
11:26
there are 900
11:30
you know all but one uh uh are gonna
11:33
fail
11:34
and so that's a little bit of an
11:36
undesirable uh property
11:37
uh this hurting effect uh and it's a
11:40
real problem in practice you know both
11:42
on
11:43
uh small scale multi-core machines as
11:46
well of course in a setting like this
11:47
where you know network
11:48
messages are not free so it's
11:51
interesting that
11:52
actually zookeeper provides enough
11:53
primitives that you could actually do
11:55
quite a bit better
11:56
you can actually build a lock that
11:57
doesn't have suffer from the herding
11:59
effect so a better lock
12:08
and this is interesting let me uh pull
12:11
up the
12:13
pseudocode for this which is in the
12:15
paper
12:16
so that we can look at it and discuss
12:19
you know why you know this lock is
12:20
better
12:21
and particularly what we'll see is that
12:23
this lock is better because basically
12:25
there's no
12:26
there's no retry where all clients that
12:30
didn't succeed
12:31
getting the lock will retry to try to
12:33
get it instead you know basically
12:35
the all the clients are sort of foreign
12:38
and
12:38
they get the log one by one and the way
12:41
you know
12:42
that you can program that using
12:43
zookeeper's primitives is in this
12:45
particular
12:46
pseudocode and there are a couple
12:48
differences
12:49
compared to the previous one first of
12:52
all
12:52
the
12:56
there's an additional flag you know
12:57
passed to create namely
12:59
sequential which basically means that
13:01
every
13:03
these files are created the bug file is
13:05
created but it will be created as
13:07
you know the first one will be log0 then
13:10
the next one will be lock one
13:12
etc
13:16
so we have like you know if i was a
13:18
client just rushing you know to the
13:19
servers to actually try to acquire the
13:21
lock
13:22
basically a thousand files will be
13:24
created all numbered from zero to
13:26
99.99 then
13:29
uh so i'll succeed in creating a file
13:33
uh the the create returns actually the
13:36
number that the that you got so if the
13:40
client zero
13:41
if the first client gets no block
13:43
creates log zero
13:44
then it will get a zero back and the
13:46
second one get a one back et cetera et
13:48
cetera
13:49
then you know the sugar could basically
13:51
ask you know to get all the children in
13:53
that
13:54
directory under which you know these
13:56
files are created and so in this case
13:57
maybe there'll be a thousand
13:59
uh children a thousand z nodes uh
14:02
and then you can just look at the n and
14:05
uh c of your n in this case zero is the
14:09
low
14:09
z note in c uh and if that's the case
14:12
that means you got the lock
14:13
and so that makes totally sense correct
14:15
the first line gets actually a zero back
14:17
uh all the other clients have a higher
14:18
number because they're sequentially
14:20
numbered
14:20
and so the first client will succeed in
14:22
getting it and all the other ones
14:24
what they're going to do is they're
14:26
going to
14:27
look at they're going to find the p you
14:29
know the number that's right in front of
14:31
them so for example if this is client
14:34
that got back you know log 10 it's going
14:36
to look you know for
14:38
z node 9 you know log 9 and basically
14:40
put a watch on that file
14:45
so this means that every client uh will
14:48
have a watch
14:49
basically on its previous session so
14:51
here you're going to see that sort of
14:52
all the clients form a line
14:54
and um and then you know the client is
14:57
going to
14:58
wait for that notification to go off and
15:01
so that means for example when client
15:03
zero
15:03
you know we've got the zero back you
15:05
know uh releases the lock you know we'll
15:07
delete in
15:08
uh this will get a notification to go
15:10
off for the file
15:11
one and so the client that's actually
15:14
waiting for that particular notification
15:15
then
15:16
uh will run but it's the only one it
15:19
runs
15:20
and uh and it will succeed
15:25
and so here we see you know this is some
15:27
sometimes these are types of blocks are
15:28
called ticket locks
15:30
in uh multi-core programming if you're
15:32
familiar with them and they have sort of
15:34
the same uh
15:35
the sort of the same idea of certain
15:36
ticket locks exceptions are built in
15:38
into this using zookeeper primitives
15:42
and again what is interesting about it
15:44
is that
15:45
you know these primitives are powerful
15:46
enough that you can actually build these
15:47
kind of blocks
15:51
any questions about this
15:57
okay i want to make one more comment you
15:59
know about the uh these locks
16:01
uh before moving on to uh chain
16:04
replication
16:05
we have a question in the chat yeah okay
16:07
what's the question in the chat
16:08
what is the watch for online for
16:12
good go back on this four line four
16:17
yeah i think that's the question
16:20
[Music]
16:21
that watch my comment of the watches
16:23
actually should go with you know
16:25
line five so there's no watch on line
16:27
four correct
16:29
uh line four just finds p you know the
16:31
number that's right before
16:33
you know you're in
16:39
if that doesn't answer the question
16:40
please uh you'll come back later that's
16:42
fine
16:43
i actually have another question yeah
16:45
this is going actually i think back a
16:47
few slides but
16:48
how does zookeeper determine that the
16:50
client has failed and thus released the
16:52
ephemeral lock
16:53
like if it it's just like partitioned
16:56
for a moment
16:57
yeah so uh well so that could be
17:00
happening so the client might actually
17:01
so the client has a session
17:02
right with the zookeeper service
17:06
and the client needs to actually the
17:09
zookeeper and the client basically
17:10
saying heartbeats to each other
17:12
and if the zookeeper service doesn't
17:15
hear from the client for a little while
17:17
then it just decides the client is down
17:19
and closes the session
17:21
and so the client can try to send
17:23
messages on the session but the session
17:24
is just closed it's gone
17:26
and any files that were created in the
17:29
emphasis that were created during that
17:31
session are basically deleted
17:34
and so if the network recon uh
17:36
reconvenes or
17:38
reveals then the client can will try to
17:41
send messages over that session and
17:42
basically those upper servers will say
17:44
like ah
17:45
that session doesn't exist anymore you
17:47
got to retry or restart a new session
17:50
got it thank you okay good
17:54
so there's one important point about
17:55
these uh what i call z
17:57
locks where z zookeeper locks
18:00
and that is they're not the same or have
18:03
similar semantics
18:05
like the locks that you were using or go
18:06
locks or mutexes
18:09
and it's sort of an important point to
18:10
realize even though they're different
18:12
we'll see in a second they're still
18:13
youthful
18:14
but they're not as strong as the sort of
18:16
the go locks
18:18
in particular the case that is
18:19
interesting is like when the lockholder
18:21
fails
18:23
so if the lock holder fails
18:28
basically zookeeper decides that
18:29
lockheed holder has failed correct as we
18:31
just discussed
18:32
then it is possible that we're going to
18:35
see some intermediate state
18:42
and remember like the whole rule that
18:44
locks is like you know it's a critical
18:45
section
18:46
you know you're some invariant is true
18:48
you know while you're going
18:49
or through the critical sections on that
18:51
invariant might not be true but then at
18:52
the end you
18:53
re-establish new variants uh
18:57
in here it's the case like you're
18:58
required to walk a client requires walk
19:00
does some steps
19:01
and then you know maybe zookeeper
19:02
decides that the client is decided to
19:04
decline this crash
19:05
and uh basically revokes lock but you
19:08
know the state you know the system might
19:09
actually
19:10
be in some intermediate state for return
19:11
on the invariant was not true
19:13
right so it's not the case that
19:15
basically uh these logs guarantee
19:17
adamicity of a
19:19
critical section uh so what
19:22
so what they do what they're useful for
19:29
is for some other purposes
19:33
in fact there's sort of two primary use
19:35
cases one i think is leader election
19:43
so uh basically if we need we have a set
19:46
of clients that need to select a leader
19:47
among them you know we can just
19:48
they can all try to basically create the
19:51
log file
19:52
one of them succeed that's basically
19:54
then become the leader
19:55
and that leader you know could clean up
19:57
any intermediate state
19:58
if possible if necessary
20:02
or you know do these atomic updates
20:04
using the ready trig
20:05
where basically you do a bunch of writes
20:07
you know to some file but then you
20:08
expose the file only at the very end
20:10
and that way make a set of uh writes
20:13
actually a sort of more transactional
20:15
work so that's one use case for these
20:18
kind of locks
20:19
the second use case is what i will call
20:21
soft locks
20:27
the soft locks you know the way to think
20:28
about it is that uh
20:30
let's say we have a worker like in the
20:33
macro do style
20:34
and the map we want to basically arrange
20:36
that you know every
20:38
uh worker executes a particular map task
20:40
only once
20:41
uh and um and so now one way to do that
20:45
would be basically take a lock out
20:46
you know for that particular uh input
20:49
file
20:49
uh run you know the computation and then
20:52
uh
20:53
once the mapper is done then you release
20:55
the log file so this will cons
20:57
this will cost only one mapper to in the
20:59
common case to execute
21:02
a particular task and um and that's
21:05
exactly sort of what we want but of
21:06
course you know if
21:08
the mapper would fail then the lock will
21:10
be released and then you know we might
21:12
execute it a second time because
21:13
somebody else will try to require to
21:15
lock and so in that case you know for
21:17
the case of mapreduce that's perfectly
21:18
fine correct
21:19
the it's okay if the task gets executed
21:22
twice
21:24
it happens twice
21:29
and in some ways what it really is it's
21:31
more performance optimization that in
21:32
the usual case you want to take uh
21:34
you want to have it actually it's
21:35
actually could do that only once uh but
21:37
you know if there's a failure you know
21:38
it might be the case that you're
21:39
executing a map drop twice
21:41
and then that thing you know and the
21:42
mapreduce usually meant to set up in
21:44
such a way that actually that is
21:45
uh that's okay and so in those cases
21:48
these sort of locks are really useful
21:50
too
21:52
any questions about this about the
21:55
this perspective on locks you know that
21:57
the zookeeper walks are not exactly like
21:59
the go
21:59
locks and you know it's just an
22:00
important thing to keep in mind
22:03
all right go ahead alexander
22:06
i guess uh yeah i had a question you
22:08
said that uh one of the differences is
22:11
that um
22:12
in in z locks the
22:16
if if the if the server holding the lock
22:20
dies then the lock can be revoked but
22:24
does that still happen if you don't pass
22:26
the because there's that flag
22:30
called uh yeah fmero yeah fema yep
22:33
that have only happens with the
22:34
ephemeral fire right so can can we just
22:37
like
22:38
um emulate the golocks by not passing
22:41
ephemeral
22:42
okay good what would happen then
22:46
so you created basically a persistent
22:47
file the client
22:49
dies you sub deadlock
22:53
and so the lock will keep on existing
22:55
and nobody will release it
22:58
and we have a deadlock
23:01
because the one person that could
23:03
release it is dead
23:06
or crashed and in fact this is like why
23:09
i get ephemeral partners there
23:13
um is is it actually the only
23:16
the only person who could release it
23:18
because anyone can delete that file
23:20
right
23:20
because you have like a background like
23:23
but that would break
23:24
but that break you know maybe the other
23:25
clients are still running you know also
23:27
still thinks it holds a lot
23:32
that's true now you guys are getting a
23:34
sort of this
23:35
basically you're this is the consensus
23:36
problem all over again right
23:39
uh and you know we
23:44
so this is sort of a clean way to get
23:46
most of it uh but not all of it
23:49
if you will and i think you know if you
23:51
want to make things atomic across a
23:52
number of you know a set of rights
23:54
atomic
23:55
you know you basically use this trick of
23:57
um
23:59
uh you use this trick of basically
24:02
the ready trick where you do a bunch of
24:04
writes and then you expose them at the
24:05
same time
24:10
could you explain the soft locks again
24:13
uh okay soft locks uh means that
24:15
basically an operation can happen twice
24:18
uh and so in the common case if there's
24:20
no crashes it will happen once
24:22
you know because the the client will
24:23
take a lock out they'll do the operation
24:25
release
24:27
and but if their client failed uh
24:30
halfway through for example then the log
24:32
would be automatically released by
24:33
zookeeper and then maybe a second
24:35
you know client will execute the same
24:36
map task
24:41
so in the case of later election i uh
24:44
what's the intermediate state that could
24:45
get exposed here
24:47
it seems that uh the first yeah okay in
24:49
the pure leader election there would be
24:51
no intermediate state but typically the
24:52
leader will create a configuration file
24:54
right as we saw in zookeeper where you
24:56
know uh
24:57
using the ready trick i see and so you
25:00
just write the whole file and then
25:02
convert it atomically as we
25:03
name it okay thank you um sorry could
25:07
you
25:08
explain what the ready trick is
25:12
i was hoping not to because i think we
25:14
talked about it last time sorry
25:16
all right so uh maybe we
25:19
you can hold that question and well i'm
25:20
happy to do it at the end of the lecture
25:21
again
25:26
because otherwise i have little time to
25:27
actually talk about uh chain replication
25:35
any other last-minute questions
25:40
okay let me set the chain uh the states
25:42
were chain replication a little bit
25:44
and that will also come back to
25:45
zookeeper in some sense
25:47
uh and basically it turns out there's
25:50
sort of two
25:50
common approaches to build replicated
25:53
state machines
25:54
and we really haven't called out these
25:55
two approaches you know we've seen them
25:57
but i've really talked
25:58
explicitly about them and i want to do
26:00
this at this time explicitly
26:01
so
26:05
because there are some interesting
26:06
observations to be made
26:08
approaches to building replicated state
26:10
machines
26:14
and the first one is
26:18
the one we basically have seen in the
26:19
labs which is you run
26:21
all operations
26:25
you know through raft
26:30
graft which you keep a raft or you know
26:32
access whatever
26:33
you know consensus you know distributed
26:36
consensus algorithm that you're using
26:38
and so this is sort of like the key
26:39
value store right in
26:41
lab 3 where you know you do put a get
26:43
operation you run all the put in get
26:45
operations through raft
26:47
and you know the surfaces basically
26:49
update you know the key
26:51
to our state as the operations are
26:53
coming in
26:54
on the applied channel and uh
26:57
you know and basically we have our
26:59
replicated state machine
27:00
so this is sort of like how lab3 works
27:06
it turns out that style where basically
27:09
raft is used to also
27:10
uh run all the operations it's actually
27:12
not that common
27:13
uh we'll see some other designs later in
27:15
semester to do that too like
27:16
spanner does it but it's not actually
27:19
completely
27:20
the standard approach or the more common
27:22
approach actually is to
27:24
have a configuration server like
27:26
zookeeper
27:33
service and the configuration service
27:35
itself internally you know might use
27:37
paxos raft or uh
27:41
uh or zap or whatever and uh and
27:44
really the configuration services really
27:46
plays the role of the
27:48
coordinator or the master like the gfs
27:50
master
27:52
and in addition to basically having
27:54
configuration services actually
27:56
implemented
27:57
uh using you know one of these uh rav
27:59
texas algorithms
28:01
you actually run a primary backup
28:04
application
28:12
and so think about gfs you know or that
28:15
we saw early in the semester
28:17
has that sort of structure right in the
28:18
gfs that was a master
28:20
and that basically determined you know
28:22
which set of servers hold the particular
28:24
chunk
28:24
and so basically determine the replica
28:26
group for chunk
28:28
and then the replica chunk group
28:30
basically executed the primary backup
28:32
replication
28:33
one of the chunks was the primary and
28:35
the other ones were the backups and they
28:37
basically had a protocol that they used
28:38
for primary backup replication
28:41
you could think about vmvt in a similar
28:43
style where the configuration server is
28:45
basically a test and set server you know
28:46
which basically recorded who is actually
28:48
the primary
28:49
and then the primary and backup have a
28:50
protocol to basically send you know
28:52
channel
28:53
operations down to channel and so that
28:54
the primary backup is sort of
28:56
roughly in in sync and implement a
29:00
replicated state machine
29:02
and this this approach you know tends to
29:04
be sort of the more common approach
29:10
although you know the approach number
29:12
one also happens
29:14
uh one we one way to think about this uh
29:17
is that you know um if
29:20
the raft states like for example our key
29:22
value survey laptop would be gigantic
29:24
you know have a
29:24
huge amount of this state you know
29:26
terabytes of key value server
29:28
would rather be very good match for that
29:30
kind of application
29:34
or what is the risk
29:38
or the potential problem
29:47
um we flush the log very often so
29:51
maybe that could be problematic that
29:53
could be problematic yeah like
29:55
what what is it what's the size of the
29:57
checkpoint
29:59
if our key value server is really big
30:04
it's linear in the size of the key
30:06
values yeah so the techniques could also
30:08
be gigantic right so if any
30:10
you know if any time the checkpoint has
30:12
to be sent you know it's going to be a
30:13
big checkpoint
30:14
and sort of like graph it's not really
30:16
sort of set up you know
30:19
and so the primary is going to
30:20
communicate you know the new primer is
30:21
going to communicate these
30:22
snapshots as you did in lab2d you know
30:24
to the the backups and you know they're
30:26
going to be big
30:27
and so you often want to maybe a little
30:29
bit more clever plan
30:31
to sort of synchronize you know
30:32
re-synchronize new servers
30:34
and uh so one reason that basically
30:38
often these things are split into two
30:40
different pieces where there's the
30:41
configuration servers that basically
30:42
small in terms of state
30:44
and then a primary backup plan that
30:46
actually you know may
30:47
replicate a huge amount of data and so
30:50
this is why one reason you see sort of
30:52
both approaches
30:54
does that make sense i'll come back to
30:58
that at the end of the lecture again
30:59
uh but it's important to keep this in
31:01
mind so what approach
31:03
or what benefits does one give over to
31:07
uh you don't have to have two of them
31:09
right in one you basically have raft you
31:12
run the operation for a fruit and an
31:13
industrial configuration for you too
31:15
so everything is in a single single
31:18
component
31:20
and here in number two we have two
31:22
components you know we have a
31:23
configuration service that includes
31:24
draft
31:25
and we have a primary backup scheme
31:30
so maybe this will become become more
31:31
clear as i talk about chain replication
31:36
um yeah i had a really quick question so
31:39
like
31:39
essentially for two i guess would like
31:42
what the advantage be that you have like
31:44
like it's consensus reached through the
31:46
leader and the leader never fails right
31:48
like
31:50
yeah so the advantage of two is that as
31:52
we'll see the second in chain
31:53
replication
31:54
is there's sort of a separate process
31:55
that takes care of the figuration part
31:57
and you just don't have to worry about
31:59
it in terms of your primary backup
32:00
replication scheme
32:02
uh and that just decides like in gfs
32:05
that's sort of like the master it just
32:06
decides to have a year to set the
32:08
servers that form this particular
32:09
replica group
32:10
and the backup primary backup protocol
32:13
doesn't have to think
32:14
about this thanks
32:17
and so and this is a good introduction
32:19
to chain replication because the chain
32:21
replication is exactly
32:22
uh sort of a primary backup
32:26
replication scheme for approach two
32:35
and that is to say that you know uh
32:37
chain replication assumes
32:38
there is a very configuration servers
32:42
uh i think they're called the master
32:44
processing the paper
32:46
then chain replication themselves
32:47
there's a couple cool properties
32:50
one read operations or
32:54
they call them query operations
32:57
involve only one server
33:01
namely the tail as we see in a second
33:06
another nice property about chain
33:08
replication that has a very simple
33:10
recovery plan
33:13
and we're going to talk about all these
33:14
in more detail in a second
33:18
and then presumably something that you
33:20
started having
33:22
you you appreciate given the fact you
33:23
know how complicated it can be in raft
33:26
uh and you know it provides actually
33:28
strong
33:30
a strong uh properties namely
33:32
linearizability
33:34
for the put in the get operations uh and
33:36
finally just a lot of people ask this
33:38
you know there's actually a reasonable
33:40
influential design
33:43
and it's used by quite a number of
33:44
systems
33:46
uh so this has been used in practice
33:49
so this is so i'm going to talk about
33:51
each of these components in a little bit
33:52
more detail
33:53
and and then we'll come back to this
33:55
sort of approach one versus approach too
34:00
uh so in terms of an overview
34:06
oops what happens here if there's an
34:09
overview
34:11
user delay of the land you know there is
34:12
a a massive process
34:17
or a configuration service
34:23
and that basically keeps track you know
34:25
which servers you know belong
34:27
you know to a particular chain so
34:30
s1 s2 s3 you know basically have a
34:33
record of what the chain is
34:34
who the head is and who the tables and
34:37
so
34:38
that's the configuration server and here
34:40
we do actually have our servers you know
34:41
s1
34:43
s2 s3
34:49
and one of them is the head
34:53
typically the one with the smaller
34:55
number and
34:56
one is the tail
35:00
and so if we have a client
35:03
the client may talk to the configuration
35:04
server learn you know who actually
35:07
is part of the chain uh and then it
35:10
sends a write request
35:11
to the head so this is the protocol
35:14
in chain verification the right request
35:16
always goes to the
35:18
uh the head and what the head does the
35:21
head basically pushes
35:22
you know the head actually applies the
35:23
operation uh on its
35:25
state and maybe it has a disk you know
35:26
associated with it you know where
35:28
whatever stores key value server on it
35:30
and then it sends the
35:32
update you know the result of the uh
35:34
operation
35:35
down the chain in fifa
35:39
order and reliably so s1 will send the
35:42
update to
35:43
s2 it has to have made his own disk
35:46
apply the
35:47
operation word state change to its
35:50
state once it actually supplied it you
35:53
know it will forward it to the
35:56
last uh node in the chain and this can
35:58
be because there are only three nodes in
35:59
this particular chain
36:00
you could have chains that are longer
36:01
you know if you want more availability
36:04
um and when the last node gets the
36:08
message or the state change uh it
36:10
applies it to you know its state
36:13
and then now this is in charge actually
36:15
we're sending an acknowledgement back
36:16
you know to the client
36:20
so it's the tail who sends the
36:21
acknowledgement back
36:23
um and so uh one way to think about this
36:28
is that when the
36:29
tail or in this case s3 you know
36:31
actually applies to state change
36:33
that's sort of the commit point
36:38
and the reason is is the commit point is
36:40
because subsequent reads
36:43
always come from the tail so if anybody
36:45
or any other client
36:47
you know does a read operation they
36:50
always go to detail
36:52
and the tail basically responds
36:53
immediately back to them
36:55
so read operations go to detail so here
36:57
is client one here's point two
36:59
client two does read operation uh it uh
37:02
goes to the tail the tail responds and
37:04
that's it
37:06
and so there's a couple things that i
37:07
want to sort of point out um
37:10
the one one of the interesting points
37:12
out is that the read operations just
37:13
involve one server
37:15
right and like and if you remember from
37:17
uh lab three or if you're in progress if
37:20
you're
37:20
starting to do lab three read operations
37:22
actually involve
37:24
uh you know in our implementation read
37:26
operations go through the
37:28
raft log and all that kind of stuff
37:31
the paper discusses in optimization but
37:34
the read operation always goes to the
37:36
leader
37:36
and the leader first has to you know
37:38
contact the majority of the servers
37:39
before it can execute the operation
37:41
locally
37:42
uh so what you see here is that the read
37:44
operations actually go through
37:45
completely so
37:47
to a different server from write
37:49
operations so the read and write
37:50
workload is actually spread at least
37:52
over two servers
37:54
furthermore the read operation involves
37:57
only one server
37:58
there's never he doesn't have to talk to
38:00
any other server it can just respond
38:01
immediately and we'll see a little bit
38:03
later
38:03
why this is actually uh important or why
38:06
this is actually what what further
38:07
optimizations this
38:08
allows and so the commit point is really
38:12
you know
38:12
at the point that the right actually
38:14
happens at the tail end
38:15
because at that point the right
38:17
operation is visible to
38:19
readers and not before any other point
38:23
and this also you know provides this
38:25
linearizability so it's pretty easy to
38:27
see that in the case of no crashes you
38:28
know this
38:29
scheme guarantees or linearizability
38:31
because the rights are all applied in
38:33
some total order
38:34
at the head and
38:37
when the tail receives you know that
38:39
update it's the commit point
38:41
it wants to respond you know to the
38:43
client and send the request back
38:45
if you know that same client immediately
38:46
does a read operation
38:48
it will go to the tail and it will
38:50
observe the last change
38:52
so uh certainly within a single client
38:54
basically all
38:55
operations are totally ordered it's
38:57
pretty easy to see that
38:59
incline if client two start to read
39:01
operation after
39:02
you know clients one uh operation has
39:04
finished and when is it finished then
39:06
when it the tail has responded
39:07
so any read operation that starts after
39:09
a write operation
39:11
will observe you know the last or the
39:13
result of the most recent right
39:15
and so it's pretty easy to sort of get
39:16
an intuition here that you know this is
39:18
going to
39:18
provide us with later activity
39:22
okay so uh what i'd like to do now is
39:25
actually you know take a quick breakout
39:27
room uh section and where i would like
39:29
you to discuss
39:30
the uh question that wasn't a post in
39:32
lecture
39:33
you know what could go wrong or like
39:36
would a great linearizability if instead
39:39
of having the tail
39:40
respond to the client uh have the head
39:43
respond to the client immediately after
39:45
it has received you know the
39:48
uh write request
39:51
and maybe that's a good topic sort of to
39:53
debate a little bit and if you want to
39:55
go in any other direction
39:56
to talk about the chain replication
39:57
portrait welcome but maybe that's
39:58
something to start with
40:00
so let's take a five minute breakout
40:02
room and then
40:03
uh we'll do this and i think
40:07
uh let me see jose are you gonna do it
40:10
um yeah and okay
40:13
do i have to make you something or i
40:15
don't think it's necessary
40:17
um i think it zoom changed so it should
40:19
be possible now too
40:25
[Music]
40:27
yep that's right
40:31
yep okay
42:14
uh
42:56
[Music]
47:43
hey okay are we coming back uh yeah
47:45
whenever you
47:46
i'm ready okay and then i think i can
47:50
close
48:18
so
49:00
back
49:07
okay good uh so you know just very
49:10
quickly uh to summarize you know why
49:12
uh you know that would break
49:14
linearizability yeah so the
49:16
the protocol change change that was
49:18
contemplated was to
49:20
you know both you know keep propagating
49:22
to s1 and s2 and s3
49:24
uh but you know as soon as s1 actually
49:26
is done uh
49:27
with its propagation the that responds
49:29
back to the client
49:31
and clearly that is will would break
49:34
linearizability because
49:35
uh let's say this client wanted to write
49:37
got the announcement back
49:38
you know from s1 as one of course has
49:41
you know the right also in progress
49:42
s2 and s3 but maybe before you know s2
49:46
actually contacts s3
49:47
the client actually sends a read
49:49
operation you know to
49:51
uh s3 and of course now it will return
49:54
uh
49:54
the value from before the write was done
49:56
so the client doesn't even observe its
49:59
own rights and so that would clearly
50:01
break linearizability so it's very
50:03
important
50:04
that you know as i said earlier that
50:06
these the tail
50:08
actually sends the acknowledgement back
50:10
you know to the client
50:12
because really the once the tail has
50:14
processed the
50:16
right operation that is actually really
50:18
what the commit point is
50:22
any questions about that
50:28
okay now so this is sort of normal
50:30
operation
50:31
and i want to talk a little bit about uh
50:33
crashes uh you know
50:34
since it's 8-4 distributed systems so
50:37
all the actions is where when the
50:39
failures happen
50:43
and one of the things that is cool about
50:45
chain replication
50:46
is you know the number of failure
50:48
scenarios is actually quite limited
50:50
and so let me uh so basically the three
50:53
cases namely the head fails
50:55
the one of the intermediate search fails
50:57
or the tail fields
50:58
so let's look at one of each one each of
51:00
those cases so
51:02
here's our they're the case we have
51:04
ahead it's s1
51:06
let's say that it applied u1 u2 and u3
51:10
for free updates
51:12
uh talks to s2 well maybe s2 has done u2
51:17
uh and u1 and you know
51:21
where we have s3 which is detail and it
51:24
only has done year one so far
51:26
and so the client was the client was
51:28
talking to
51:29
s1 um and
51:33
we now want to think about like what
51:34
happens uh what needs to happen
51:36
if one of these crashes and so let's
51:38
start with the case that the
51:40
head crashes and so the head crashes
51:44
what needs to be done
51:47
this is an easy case this is a hard case
51:52
it's easier i hope it's an easy case why
51:55
flip it
51:56
uh you can just cut off um the head
52:00
oh sorry the yeah the head and you know
52:03
make
52:04
uh s2 the head now yeah we can just
52:07
promote
52:07
you know so what's going to happen
52:09
correct is that the configuration server
52:10
discovers that s1 is gone
52:12
or decides that s1 is gone and uh
52:16
then uh basically uh can promote s2 to
52:19
be the
52:20
head uh in subsequent uh operations
52:24
and clients now in the future then talk
52:26
to this guy
52:27
and why is this correct so what
52:30
operation have we lost
52:32
we lost you three yeah is that a problem
52:36
that's valid to lose operations
52:40
yeah it's fair game to lose you're free
52:42
correctly your free has not been
52:43
committed
52:44
because only operations at the tail are
52:45
committed and so it's just as if the
52:48
operation never happened you know the
52:49
client could not even have observed you
52:51
know that actually you
52:52
uh that this uh u2 or you're free
52:55
actually has happened or you three yes
52:57
okay so it's perfectly fine to do this
53:00
why is it important that the
53:01
configuration server is actually
53:02
involved here could like stu ii decides
53:05
on its own to become the head
53:07
let's say s2 couldn't talk to he has one
53:08
anymore and decides like ah whatever i
53:10
want to become head
53:11
would that be valid
53:14
wouldn't that uh like maybe create a
53:17
split
53:17
uh yeah yeah that would create a split
53:21
break right because it might you know f2
53:23
might then just be partitioned
53:24
from s1 and so now both are heads
53:28
and maybe both are processing commands
53:29
you know we you know violate our
53:32
uh you know basically sort of this whole
53:35
property of having a total order
53:38
you know um does this to even know that
53:42
s1 is ahead
53:43
uh because it just receives um
53:46
it probably got it from the
53:47
configuration information in the
53:49
previous time
53:50
right like when you know configuration
53:51
service decides on the new configuration
53:53
they can tell all the servers and
53:54
whatever and the clients that actually
53:56
care you know here's the new
53:57
configuration
53:59
wait does this only happen when like s1
54:02
to s2 connection is separate or
54:04
wait what so what causes the split brain
54:06
again
54:07
display brain would happen if s2 on its
54:09
own decided that as one has failed and
54:11
became the head
54:13
and so we're not allowed to have that
54:14
happen and the way actually the
54:16
will work out in practice is that there
54:18
is a configuration server
54:19
that actually decides what is actually
54:21
the current configuration
54:23
and so if it decides that s1 is dead
54:27
then it can inform s2 and f3 saying like
54:30
hey
54:30
you guys are now the new chain and s2 is
54:33
the head
54:34
and when that change happens so in this
54:36
case basically s1 is dropped
54:38
nothing else has to happen uh because uh
54:41
s2 ha the only update that we lost is
54:45
the one that actually was not committed
54:46
anyway so there's nothing to be repaired
54:47
further
54:49
so making this going from this setting
54:52
from free replicas with the dropping the
54:54
head is a basically pretty
54:55
straightforward operation
55:00
okay i have a question
55:03
so there is there's an assumption here
55:06
right that um
55:07
like the uh commands that like
55:11
like leave s1 like will arrive
55:14
in order uh in s2
55:17
is that like is that a reasonable
55:20
assumption for like
55:21
well i think the way they uh so they
55:23
basically say you know we
55:24
need a reliable fifo between s1 and s2
55:27
right and from s2 to s3
55:29
and i think the way they basically
55:30
implement that is probably using tcp
55:32
connection
55:34
okay thanks okay so
55:38
uh let's look at the second case uh so
55:40
we
55:41
have you know s1 you know
55:44
s2 s3
55:47
and of course there could be more you
55:48
know servers in the chain but like you
55:50
know
55:50
uh three is enough for us to consider
55:52
all the cases
55:53
and so now what we want to do take the
55:56
case where a middle one crashes
55:58
so this one crashes and uh
56:02
so the configuration server at some
56:03
point decides yeah s2 is crashed
56:05
you know informs s1 and s3 uh basically
56:08
they form the new chain
56:10
uh and we're wondering about like what
56:13
else needs to happen
56:15
okay we saw in the first case the head
56:17
drops then nothing really has to be done
56:19
other than updating the chain now we're
56:21
updating the chain and the question is
56:23
is anything needs to happen
56:26
the s1 needs to send to s3 the request
56:30
that it's sent to s2 but didn't make it
56:32
to s3
56:33
yes exactly right so we have you know u1
56:36
u2
56:36
u3 uh we this was guy had seen u1 and u2
56:41
this guy has ceo one uh
56:44
and you know the u2 that's actually in
56:46
progress you know
56:47
uh uh
56:51
and then i got lost mass two basically
56:52
s1 has to bring s3 up to date
56:54
and basically four and u2 and u3
56:58
okay so there's actually a little bit of
56:59
work involved let's consider the final
57:01
case
57:03
the tail so here we go again
57:06
we have three cases it's one
57:09
or the three servers s2 yeah the third
57:13
one s3
57:16
and uh and let's see so
57:19
the tail crashes
57:22
and so at some point in time uh
57:26
the configuration server notices decides
57:28
that the new chain is going to be f1 and
57:30
s2 tells f1 and m2
57:31
that they're part of the new chain and
57:34
uh what else needs to happen
57:38
well let's write down we know
57:41
these guys have seen u1 u2 and u3 this
57:44
guy has seen u1 and u2
57:47
so who becomes the new tail in this
57:49
scenario
57:52
s2 yes student becomes a new tab and
57:54
anything else that needs to happen
57:57
um i guess the client needs to be
57:59
informed that s2 is the
58:01
yeah that's what the client might learn
58:03
is from the configuration server correct
58:05
uh and so yeah but nothing else has to
58:07
happen right because
58:08
uh the all the committed you know the
58:13
no committed operations are lost uh and
58:16
uh you know and as we are still in
58:20
uh needs to still be actually propagated
58:22
to s2 and won't be just happening
58:24
okay so dropping details also reasonable
58:26
straightforward so dropping the tail in
58:27
the head is
58:28
reasonable straightforward dropping the
58:29
middle one is a little bit more
58:30
complicated
58:31
but not much more complicated and the
58:33
key thing though i want to
58:35
sort of emphasize here is how does this
58:37
compare to figure seven and eight
58:39
in the rafter
59:03
these new operations have been
59:04
automatically committed
59:06
uh sorry i didn't hear you uh you were
59:08
pretty uh
59:09
it was a pretty noisy connection there
59:12
yeah so i was just saying that this too
59:15
becomes a new uh
59:16
tale don't we have to send
59:18
acknowledgements back to the client that
59:19
there are some entries that have been
59:21
automatically committed
59:23
uh yeah that might be the case that uh
59:26
what
59:27
will happen is the client is probably
59:28
gonna retry correct and
59:30
uh we have to have a separate dude
59:32
duplication scheme like in
59:33
uh lab three anyway and so there's
59:36
a probably a couple different ways about
59:39
how to do it the paper is actually not
59:40
particularly clear
59:41
which one it will take
59:44
thank you yeah like in that in that case
59:47
then in the paper just say like you know
59:50
it might
59:51
like even if it doesn't respond it could
59:53
or could not like have succeeded
59:55
right like yeah okay so
59:58
back to my actually original question
59:59
which is like you know how does this
60:01
contrast this picture my you know my
60:03
drawing picture here on this white board
60:05
how does that contrast to figure seven
60:07
and eight
60:08
simpler yeah i mean that's the key point
60:11
i wanted to get across right these uh
60:13
you know there's
60:14
not that many cases to consider here
60:15
basically three cases
60:17
uh which is like slightly uh you know
60:19
quite a bit simpler than
60:20
uh the in the case of the wrap paper
60:23
where you know the many convenience to
60:24
consider and
60:25
the scenarios are quite complicated now
60:27
part of that is because
60:28
you know it's a chain right you know
60:30
things are pushed down
60:32
uh the in a very sort of uh
60:34
straightforward manner
60:36
down the replication chain and part of
60:38
that is of course you know the
60:39
configuration part
60:40
is sort of outsourced to the
60:42
configuration manager manager
60:44
but for the primary backup uh part of
60:47
the recovery plan that's a reasonable
60:48
straightforward you know there are only
60:50
three configurations to
60:51
consider
60:54
now one more sort of point i want to
60:56
make which is like you know how to add a
60:57
replica
60:58
because any system you know that you're
61:00
going to run for real time really at
61:02
some point you got to
61:04
add a new one in because otherwise you
61:06
know you're going to lose when you start
61:08
your three then you have two then you
61:09
have one and then you have zero and then
61:10
you're unveiled
61:11
so you know you have to add new
61:15
replicas so let's consider the case uh
61:17
so here's s1
61:19
it is the head and let's say we're in
61:22
the scenario where
61:23
you know we have s1 and s2 which detail
61:26
and basically we want to bring up you
61:28
know s3
61:32
and it turns out you know as the paper
61:33
described that it was most convenient to
61:35
actually do this
61:37
at the uh at the tail end of it and so
61:40
basically make the new server
61:42
the uh the new tail and so the way that
61:46
would proceed
61:46
is like so the client is here it's
61:48
talking to
61:49
uh s2 because that's the current tail uh
61:52
s3 comes up
61:54
and basically what the first thing it
61:55
does is actually copies you know all the
61:57
state from s3 to
61:58
from s2 to s3 and so this may take hours
62:02
right or you know tens of minutes or
62:05
maybe indeed multiple hours is like
62:06
we're copying
62:07
you know gigabytes of data or terabytes
62:09
of data from s2 to s3
62:11
but while that's happening you know as
62:13
two and as you know as free can just
62:15
surf requests of course it does have to
62:18
remember which ones uh
62:20
are came in after you know s3 start
62:23
copying so keep a list of like all the
62:25
updates that are sort of
62:26
have happened but there have not been
62:28
propagated to s3 yet
62:30
at some point s3 is done you know with
62:31
all the copying and basically tells us
62:34
to
62:34
okay man i'm ready you know to become
62:36
detail i got the whole state
62:38
uh s2 says uh and so sends an email
62:40
basically it says emails it's a message
62:42
s2 saying like okay i want to be the
62:46
tail
62:46
has two responses like yeah that's okay
62:49
but once you're
62:50
applied all the updates
62:53
and so basically s2 sends the updates in
62:56
response to
62:57
this i want to be go on become the
63:01
tail uh requests you know in response to
63:03
that to s3
63:04
s3 applies the updates and then becomes
63:07
the the
63:08
detail and you know clients that were
63:11
talking to s2 and s2 can tell the client
63:13
you know from now on i'm not the tail
63:14
anymore you should talk to that screen
63:16
and so they can swap you know that
63:18
direction
63:19
and so that's the way to add a replica
63:22
into a chain
63:26
so question on this um don't you run
63:28
into this like infinite loop problem
63:30
where s2 sends updates to s3 and when s3
63:32
is like
63:33
updating it's also serving more requests
63:36
and so has more updates
63:37
that needs to send and goes back and
63:39
forth no no like
63:41
once s2 has sent the updates that s3 has
63:44
not seen yet you know to s2
63:46
to f3 then from then on its normal chain
63:49
replication whenever s2 gets a
63:50
request you know an update from s1 you
63:53
know it forwards it to s3
63:57
right but s3 can't become the tail until
63:59
it is successfully processed
64:01
all of the updates oh yeah so uh yeah
64:04
once it sets up the tcp channel
64:05
basically s2 can just say like
64:07
from once you have processed these guys
64:09
you can become the tail because you have
64:11
seen everything
64:12
i mean and everything else you know can
64:13
be pipelined after that right in the
64:15
same tcp channel
64:18
it it it could become the tail right
64:21
right after
64:21
like even before it processed the update
64:23
right as long as it doesn't serve
64:25
requests
64:26
and as long as doesn't serve requests
64:27
exactly right it just has to process all
64:29
the updates that s2 has received in s3
64:31
not
64:32
once it's updated those then it becomes
64:34
the
64:35
tail and start processing requests
64:38
i see so it blocks like requests for a
64:41
moment while
64:41
it processes the new updates got it
64:44
exactly
64:48
okay
64:52
okay so now uh you know i want to come
64:55
back you know go to
64:56
basically questions lots of people ask
64:57
you know how this contrasts to
64:59
sort of you know how are the cr
65:02
properties
65:02
or chain replication properties how do
65:05
how does it compare
65:07
what are the good ass what are the good
65:09
properties and mostly you know with
65:11
respect or in comparison to
65:12
rap and of course you know
65:16
i gotta say of course like uh the the
65:18
the chain replication
65:20
just implements the the primary backup
65:22
scheme but not you know the
65:23
configuration surface
65:24
and so we'll come back to that a little
65:25
bit more in detail but a couple of
65:26
things that we can note
65:27
if we just compare sort of uh
65:31
the way the raf protocol works with the
65:33
chain replication protocol
65:35
and first of all you know we can you
65:37
know positive aspect of you know chain
65:38
replication is that
65:40
the client rpcs
65:46
are in a split between
65:51
between the head and the tail
65:56
and so the load of actually serving any
65:59
client operation that can be split
66:00
actually between two of them
66:02
they don't have to all run through the
66:03
leader as in in raft
66:06
furthermore the head sends an update
66:11
once so unlike in raft
66:14
where the head or the leader basically
66:17
sends updates
66:18
the log entries to every appear in this
66:21
particular scheme
66:22
actually it's only happened the the head
66:24
only sends one
66:26
basically rpc and so there are fewer
66:29
messages involved
66:32
reads or query operations
66:36
involve only
66:40
only the tail all right in the raft you
66:44
know the
66:45
if you even if you implement the
66:47
read-only optimization
66:49
the read-only optimization avoids
66:52
having the read operation to go through
66:54
the log and
66:55
being appended to all the walks that
66:57
appears but it still requires that the
67:00
leader actually contacts the majority of
67:02
the peers to decide
67:03
whether to actually the operation can be
67:07
served
67:10
and so another positive aspect of the
67:13
spinosa simple
67:14
crash recovery as we talked about
67:23
uh but you know a major downside you
67:26
know compared to
67:26
sort of the raf scheme is that one
67:30
failure
67:34
requires reconfigurations
67:44
and the reason that the recognition uh
67:47
required is because like
67:48
a right actually has to go through the
67:51
whole chain
67:52
and so and the right cannot be
67:54
acknowledged until that every you know
67:56
server in the chain actually has
67:57
processed it and that is actually
67:59
slightly different correctly wrapped as
68:00
you well know
68:01
because as soon as basically a majority
68:04
the peers actually have accepted the
68:06
particular write operation and appended
68:07
to their logs
68:08
the system can just proceed and so
68:10
there's actually no
68:11
interruption at all if like one server
68:13
fails for example
68:15
and if the remaining server still form a
68:17
majority now while in chain replication
68:20
if one server would fail then you know
68:22
some refrigeration actually has to
68:24
happen which means you know there's
68:25
going to be a short period of you know
68:26
probably downtime
68:28
right does that make sense
68:34
now i want to make one more point uh
68:37
sort of in contrast you know to sort of
68:39
the
68:39
rafter replication scheme is that this
68:42
because the read operations involve only
68:44
one server there's a cool
68:46
uh extension if you will that actually
68:49
gets really high read performance
68:51
and so the basic idea is as follows
69:10
uh the basic idea is basically to split
69:12
you know the
69:14
split objects or they called volumes in
69:16
the paper
69:17
split the objects across multiple chains
69:27
so instead of having one chain as i
69:28
would have done in the previous
69:30
boards you know we're going to have
69:31
multiple chains and so for example we
69:33
might have a
69:34
you know ch1 and in chain one
69:37
you know s1 is the head s2 is the
69:40
middle guy and s3 is the tail in chain
69:43
two
69:45
yeah we're going to rotate things around
69:47
s2 is the head
69:48
as one is the where s3 is the middle guy
69:51
oops
69:52
s3 is the middle guy and s1
69:56
is the tail and then chain 3
69:59
we're going to arrange basically s3 is
70:01
the head guy s2 is the
70:04
uh s1 is the middle guy and s2
70:07
is the tail and basically what we're
70:10
going to do is we're going to split
70:12
objects across these multiple chains so
70:13
the configuration server basically has a
70:15
map you know saying like you know
70:17
shard one you know objects insured one
70:19
go to chain one
70:20
uh objects in shar2 go to chain two
70:23
objects in shard three now go to chain
70:25
three
70:27
uh and what is the cool part about it is
70:29
that uh
70:30
you know we have no multiple tails and
70:32
we as well as three is the tail
70:34
for some chain as one is a uh a tail for
70:37
some uh
70:38
chain that's two is a tail for some
70:40
chain and basically
70:41
read operations for these different
70:42
chains uh can now be completely executed
70:45
in parallel
70:46
so if the read operations sort of hit
70:48
all the different charge
70:50
uh sort of uniformly spread that
70:51
basically you know the our read
70:52
throughput is going to increase you know
70:54
linearly with the number of uh
70:56
tails that we have and in this case we
70:57
have three tails so we get three times
70:59
the read performance
71:01
so we can basically sort of same with
71:02
some you know we get a little bit of the
71:03
same properties that zookeeper had
71:05
where the real performance can be
71:07
excellent you know it can scale with the
71:08
number server
71:10
uh but we also get not only that we get
71:12
the scale part
71:13
but we maintain the linearizability
71:17
in this scheme we don't have to actually
71:19
give up on linearizability
71:22
so we get sort of the two nice
71:25
properties mainly you know good
71:26
re-performance in fact the scaling with
71:28
the number of servers
71:29
at least for reach or two different
71:31
chains uh and
71:33
we got actually uh and we maintain
71:38
linearizability
71:43
any questions about this
71:49
sorry in this case the client when they
71:51
are deciding
71:53
which um chain to read from
71:56
would they be able to to do it like to
71:59
decide themselves or do they need to
72:01
contact the configuration server to
72:04
decide
72:05
yeah so this is a great question uh so
72:07
typically and the paper actually doesn't
72:09
really
72:09
it's explicit about it you know they
72:11
talk about maybe talking through
72:12
proxy you know to the servers uh what
72:15
you will do in lab four
72:16
is basically you will download the
72:17
configurations the configuration will
72:19
include
72:20
the shard assignment and you will
72:23
download that from the configuration
72:24
server
72:28
you need to be careful about how you
72:30
ordered the servers in each of the
72:31
chains to prevent like
72:32
a particular chain from being
72:34
oversaturated or a particular link
72:36
between like two servers
72:38
uh yeah this this this scheme doesn't
72:40
really take that into account uh you can
72:42
imagine like
72:43
the configuration planner uh the
72:45
configuration manager has a
72:46
sophisticated model of actually how the
72:48
network
72:48
is laid out and you know can be very
72:50
careful about how the chains are done
72:53
maybe even shower more shards to one
72:55
chain or fewer shards to another chain
72:57
uh all that stuff is in principle
72:59
possible right because the configuration
73:00
can just manage you can just compute
73:02
anything it likes and basically says
73:04
here's the assignment
73:08
thank you i can even rebalance if it
73:10
wants to
73:18
answer the question could you explain
73:19
again how linearized deal is kept under
73:21
this
73:21
uh extension well nothing has really
73:23
changed
73:24
uh we're still doing primary backup uh
73:26
using a chain
73:28
uh and uh and so you know the
73:32
we basically carry over the
73:33
linearizability from the single chain
73:35
and that's it
73:42
this might be like speculative but how
73:44
this compared to
73:46
or i guess maybe it's equivalent of just
73:48
like having groups of servers for each
73:50
like link instead of having reusing the
73:52
same ones or
73:53
not for each link but for each uh
73:57
step in the chain so like s1 is like
73:59
three servers s2 is three servers
74:01
instead of reusing the same one and
74:04
entering from different points
74:06
uh yeah what would be the advantage of
74:09
that
74:10
scheme that you imagine i mean just for
74:11
scalability um
74:14
while also maintaining mineralizability
74:17
well the
74:18
the reason that this scheme is
74:19
attractive is that because we have in
74:21
the tail might have quite a bit of load
74:22
but the middle guy doesn't
74:23
and you know by having sort of this
74:25
arrangement we spread the load across
74:27
all servers
74:30
i see okay
74:40
okay good so well maybe i want to
74:42
summarize here
74:47
and sort of talk a little bit about so
74:50
we saw this approach one
74:52
which we do which we do in lab3 which is
74:57
you know we run all the operations
74:59
through raft and all the uh
75:01
replication you know the configuration
75:03
and the replication is all built using
75:05
uh
75:06
a raft and nothing else is involved and
75:08
then sort of this approach to
75:10
you know which is the topic of this
75:12
particular paper
75:13
uh where you know there's a
75:15
configuration server perhaps built
75:17
using raft or you know packs with anyone
75:21
and a primary backup replication scheme
75:24
and primary backup
75:28
using chain replication
75:39
and you can see you know hopefully you
75:41
know this lecture makes it
75:42
clear that you know there's some
75:43
attractive properties to approach too
75:46
uh in the sense that uh you can get
75:48
scalable re-performance
75:50
uh on the primary backups uh of course
75:53
not on the configuration server because
75:54
you know it
75:55
runs raft like you know you do in
75:57
approach one
75:59
but you get at least you know maybe
76:00
scalable reperforms for actually your
76:02
operations on the replicas or on the
76:04
the primary background scheme like to
76:06
put in get operations
76:08
um and the other uh thing that is nice
76:11
about this is that sort of
76:12
the if you have your data is very very
76:15
large you know you can have more
76:16
specialized
76:17
uh uh synchronization
76:20
or uh uh schemes to basically copy the
76:23
state from one machine to another
76:25
machine
76:26
uh and you know the chain replication
76:28
order but any sort of primary backup
76:30
scheme that sort of separated from the
76:31
configuration server
76:32
allows you to do that easily and so it's
76:34
a quite
76:35
common that you know the in practice you
76:37
know people call products too
76:39
although it is also not impossible to
76:41
actually uh
76:42
use approach one for your replicated
76:45
state machine including servicing um
76:49
operations like you know put together
76:50
operations and the factors receiving
76:52
lab3 would do it
76:53
we will see later a paper in uh semester
76:56
called spanner
76:57
that actually uses you know access to
76:58
actually also do the operations
77:03
any further questions here
77:14
if not then i wish you all good luck on
77:16
the midterm on thursday
77:18
and i'll see you in person well
77:20
virtually in person
77:22
uh next week
77:28
and if you have any questions please
77:30
feel free to hang around
77:31
and i'll try to do my best to answer
77:35
that
77:37
question yep i have a question about
77:39
something that
77:40
you mentioned about raft so you
77:42
mentioned that all of the reads have to
77:43
go through the majority of servers
77:45
but i'm not quite sure i understand why
77:47
because the leader
77:49
has all of the committed entries right
77:52
yeah there's two
77:52
two schemes uh if the leader
77:56
so either you run in the situation where
77:58
all recent rights are served by the
78:00
leader
78:01
right or you know there's a possibility
78:03
to serve in principle
78:05
a read operation from another peer but
78:07
then you have to contact the first and
78:08
majority
78:10
of the service to make absolutely sure
78:11
that you have the last operation
78:14
got it so that that requirement is if we
78:17
want to
78:18
spread all the reads across every year
78:21
and then we have to be more
78:22
sophisticated and we cannot do it on our
78:24
own because that would definitely
78:26
directly interact ability
78:28
right but if everything goes to the
78:29
leader we don't then then you're
78:31
in your golden correct except you know
78:33
you have to do this trick where you have
78:34
sort of an empty agreement
78:36
uh at the beginning of every new term
78:41
just to make sure that you actually are
78:42
up to date
78:45
okay thank you could you um
78:49
quickly go over again uh when you're
78:51
adding a new server at the tail
78:54
so just to make sure i understand so
78:56
essentially
78:57
it starts this process for like copying
78:59
all the data from s2 to s3
79:01
yep and then if it receives requests for
79:04
any of that data while
79:06
that is still happening then s3 is going
79:08
to like
79:09
ask s2 for anything that it still has
79:12
directly and it's going to get it and
79:13
then respond
79:14
yep and then it keeps doing that until
79:16
it gets
79:17
data that s2 no longer has and then it
79:19
just goes live essentially
79:21
yeah well you could do it slightly
79:22
differently you know you could actually
79:24
have
79:24
s3 you could uh basically tell
79:27
s2
79:31
s3 can become the leader or the sort of
79:34
detail and
79:36
basically don't process any comment
79:38
operations from the client yet until it
79:40
actually has
79:41
has received the remaining operations
79:43
from s2
79:46
oh so in that case s2 is still the tail
79:51
yeah you do not know s3 gets everything
79:53
yeah
79:54
okay thank you there's basically
79:57
the paper describes one particular way
79:59
of doing it there's a couple ways of
80:01
doing it
80:04
but if you do that then um like
80:07
how long do you wait to get everything
80:11
um i think i also have the same
80:14
confusion as someone else well you know
80:16
you know in what order the switch is
80:17
happening correct like so for example
80:19
like
80:19
s3 uh so s2 is let's say maybe has
80:22
update operations
80:24
through 100 you start the copy operation
80:26
it's like with the snapshots in the
80:28
raft you start the copy operation so
80:30
when the copy operation is done s3 is up
80:32
to date until 100
80:34
right then you know maybe s2 already has
80:38
done
80:38
10 more operations so it has 101 102 and
80:41
103
80:42
right and basically uh
80:46
as free you can you know contact us too
80:48
saying give me
80:49
your remaining operations and stu says
80:50
like well my remaining operations 101
80:52
through 1 and 10. and and as a side
80:56
effect you know
80:57
s3 also tells us to like stop being
80:59
detailed
81:01
and s2 responds with those operations
81:05
s3 applies those operations 101 through
81:07
110
81:08
and then tells client and in the
81:11
meantime it is tail but it doesn't
81:12
process any commands from clients yet
81:14
or read operation client yet until it
81:16
actually has process 101 through 110.
81:21
okay okay i see i see
81:24
like that oh
81:33
my question was a little similar to the
81:36
uh extension that he talked about
81:39
um i thought i thought about
81:42
could you could he do a tree instead of
81:45
a chain
81:46
so ah uh i think there's
81:50
the uh there are other data structures
81:53
possible
81:54
uh like for example you know a number of
81:56
people in email proposed
81:58
that you could like have s one then you
82:00
could have like uh
82:01
s two three four five all the
82:03
intermediate ones you know s1 talks to
82:04
them in parallel to all the intermediate
82:07
ones and then the intermediate ones in
82:08
their top all to
82:10
the tail uh is that what you mean with a
82:14
tree
82:17
i meant morph there would be
82:20
like a number of leaves that would be
82:24
at all roughly the same height so like a
82:26
balanced tree
82:27
and then the leaves will have like a
82:30
chain going through them
82:33
um and i think that like linear
82:36
disability
82:37
can be broken here if you like
82:40
um if you think harder about it but
82:44
it would it would have the nice property
82:45
that you wouldn't like
82:48
then the propagation delay would be like
82:50
logarithmic instead of linear
82:53
as here and you could read from all the
82:54
leaves but yeah okay reading from all
82:57
the leaves is dangerous correct because
82:58
uh they might have you know one client
83:00
might have talked to another
83:02
or another leaf earlier and these guy
83:04
might not be in sync
83:06
uh so that sounds dangerous to me
83:11
but maybe your your scheme is a little
83:12
bit more sophisticated than i'm uh i'm
83:14
thinking
83:15
um the depth of the tree or the depth of
83:18
the chain is really uh
83:20
governed by the mean time between
83:22
failure correct
83:23
uh if you're on the main team uh if you
83:26
typically run with free service or three
83:27
to five servers because that's good
83:29
enough for your
83:30
high availability then because you can
83:33
you know recover from four
83:35
servers before the whole thing is down
83:37
um
83:38
then uh that really governs the depth of
83:41
the chain
83:43
um and yeah that will introduce some
83:46
latency
83:49
okay that makes sense yeah thank you
83:51
change will generally be short
83:53
right okay okay that makes sense thanks
83:58
is this the only case where the entire
84:00
chain would go down
84:02
if all of the servers in the chain went
84:04
down yep
84:07
thank you uh
84:10
i also was curious how you like maintain
84:12
the strong consistency when like
84:14
uh s1 s2 and s3 can all like do the read
84:18
in this slide you get strong consistency
84:24
per shard or per object you know that's
84:26
assigned to the chain
84:27
right so you read uh
84:30
you write one object you write object
84:33
one you read object one
84:35
all those operations are going to go
84:36
through the same chain and so you get
84:37
strong consistency for that particular
84:39
object
84:40
oh got it but it may not indicate that
84:43
across all the objects we have strong
84:44
consistency
84:46
no i don't uh i will let me hesitate or
84:49
not
84:51
uh
84:57
let me have state i think you know that
84:59
requires maybe more machinery
85:00
well what does that mean like across all
85:02
the objects
85:04
getting stronger you read write object
85:07
one
85:08
you rewrite object two and
85:12
then uh some client reach or object one
85:15
uh are you gonna yeah basically you know
85:18
if you have a client that reached
85:19
both objects are you guaranteed
85:21
guaranteed to see total order and
85:23
uh linearizing that like serializability
85:26
uh the serializability is slightly
85:28
different um you know i mean
85:30
let's not talk about serializability
85:31
we'll come back we'll get to get that
85:33
later in a couple weeks
85:34
uh yeah and um
85:40
i'm having to do actually make a
85:42
commitment right now i need to think a
85:44
little bit about it
85:45
okay that's totally fair
85:48
so the question is like you know you
85:51
have consistency or linearizability for
85:52
a single
85:53
client reading and writing but not for
85:56
multiple
85:57
on multiple objects you have even
85:59
multiple clients
86:00
talking to this you know doing
86:02
operations on the same object have
86:04
strongly interacted have linearizability
86:06
correct in this scheme
86:07
the question is do you have you know
86:09
linearizability across
86:11
objects too but
86:14
why is that important i i don't see
86:16
where like
86:18
where that would be important like
86:21
because you're you're gonna like
86:25
you can't group operations right so like
86:28
a right and right right i mean you read
86:31
object one
86:32
you write update one you read object one
86:34
then you read up direct q you write
86:35
object two
86:37
uh and you know the
86:40
the question is you know in
86:42
linearizability you know those
86:44
operations need to be total order and
86:46
they need to pursue uh
86:48
preserve this property off uh real time
86:52
and since you're here you have different
86:54
chains that might actually
86:55
not happen but i don't want to commit to
86:58
no statement about it all across change
87:01
you know within the chain that's
87:02
absolutely guaranteed linearizability
87:03
even if you have different objects
87:08
there's something i don't understand in
87:09
the paper which is the um
87:11
update propagation invariant where like
87:15
uh sort of the for
87:18
in this order of the chain yep the like
87:22
uh commits are prefix of each successors
87:27
commits is that guaranteed
87:30
after like a full pass has gone through
87:33
the chain
87:34
well it's always true right like you
87:35
know if you go back to this picture here
87:38
um i think the
87:42
makeup basically very simple observation
87:44
which is let's see if i can find a good
87:46
picture
87:47
i probably scribbled over everything so
87:50
it's going to be maybe not as clean
87:52
basically what they're saying like if
87:54
you look at this figure correct
87:56
that um s3 always has a prefix of s2 and
88:00
s2 always has a prefix of s1
88:05
and that's the only thing that basically
88:07
that invariant says
88:09
oh the so i and j
88:13
so the the one out the successor is has
88:15
the prefix of the predecessor yeah
88:17
and this is slightly confusing i just
88:19
realized that later uh
88:20
after somebody else asked this question
88:22
uh
88:23
in the definition the
88:26
are is in two different ways uh well not
88:29
really in different ways one is a
88:30
definition
88:32
uh and one is actually the the the
88:34
invariant
88:36
um and uh you guys are gonna be a little
88:38
bit careful
88:39
the uh the roles of the ice and in the
88:42
very
88:42
other around yeah exactly how is that
88:44
possible
88:45
exactly thank you you're welcome
88:49
uh yeah sorry go ahead
88:54
come on i was just gonna ask um what
88:56
happens when you have like a network
88:58
partition instead of a crash
88:59
um so if you go to like the crash slide
89:02
uh
89:03
what happens to the chain if there's a
89:05
network partition so maybe s
89:07
something like uh uh like s2 is actually
89:10
still alive
89:11
but there's a partition between the
89:12
configuration manager and
89:14
an s2 or something and so now both s1
89:16
and s2 are pointing to s3
89:20
no no no uh yeah okay yeah
89:23
okay there's there's gonna be some uh
89:25
presumably but i
89:26
i think the the paper doesn't talk about
89:27
this but i presume that all
89:28
configurations are numbered
89:30
uh like a view number and uh s3 will not
89:34
accept any commands from s2
89:36
if the few numbers don't match
89:39
oh you got it thank you so related to
89:42
that uh one thing i couldn't figure out
89:44
is even with like uh configuration
89:45
numbers or something
89:47
how do you make sure that when you get
89:48
rid of the tail in like the third
89:50
scenario that you've drawn
89:51
now that when you get to the tail that
89:53
uh all the clients that might issue a
89:55
read
89:55
are aware that this old server is no
89:57
longer fail clients
89:58
i think the the way you would do it is
90:00
that the configuration
90:01
when the client download the
90:02
configuration from the configuration
90:04
manager they also include the view
90:05
number
90:06
and every operation includes the view
90:08
number and and
90:09
s3 you will see hey that's an old view
90:12
number
90:13
don't talk to you i i guess uh when then
90:16
does the clients talk to the
90:17
configuration server to get a new view
90:18
number
90:22
so for example the the s2 could just say
90:25
like retry
90:26
and the client then could go back to the
90:28
configuration server and re-read to the
90:30
state
90:32
uh i i guess what i'm worried about is
90:33
like s3 has been network partitioned
90:35
away from the
90:36
the court uh from the coordinator and so
90:39
the coordinator like gets rid of s3 from
90:40
the tail and increases the version
90:42
number
90:42
uh but some client out there doesn't
90:44
find out that the version numbers
90:45
increase
90:46
and still thinks s3 is the tail talks s3
90:48
does a read
90:49
meanwhile people are doing rights to s1
90:51
s2 and
90:52
they haven't heard it or they haven't
90:54
seen that basically
90:56
yeah this is probably the reason why in
90:57
the paper go through the proxy
91:02
see okay
91:07
i have a question kind of going back to
91:09
that
91:10
like i noticed a question about um kind
91:13
of like cross
91:14
object linearizability
91:17
um i guess like is that a whole another
91:21
can of worms that we haven't really
91:23
talked about which is like if you want
91:24
to do
91:25
i don't know what the right term is but
91:26
like transactions that like
91:28
across multiple pieces of state like if
91:32
you're
91:33
trying to implement an operation where
91:34
it's like you're setting a to one and b
91:36
to two and you should only see those
91:38
together or not at all like at ethnicity
91:40
of that have we
91:41
talked about that in any of the things
91:43
we've seen before no and we'll talk
91:44
about within a couple weeks
91:46
okay they're going to be a big topic
91:48
basically how to do transactions
91:50
okay that that's good thanks
91:54
do you mind going back to like the third
91:56
slide
91:58
yep i think it was the third slide
92:03
um maybe not fourth slide
92:09
oh if it's light sorry um
92:12
so i was a little bit confused uh when
92:14
you mentioned like if
92:15
lockholder failed an intermediate state
92:17
stuff um
92:18
what exactly on this slide applies to
92:20
z-locks and what applies to the
92:22
locks again uh these are almost all
92:26
statements about z-locks
92:29
so is the first statement that if the
92:31
lock holder fails then the intermediate
92:33
state is
92:34
not cleaned up or is cleaned up no the
92:37
immediate state is visible but then for
92:40
example if you have a leader election
92:41
you could clean up that intermediate
92:42
state that was just like the
92:46
the point oh so with with go locks is
92:49
that also not the case like if there's a
92:50
machine that's holding a golock
92:52
that's doing stuff and then it all of a
92:54
sudden dies
92:56
isn't still the intermediate state
92:57
visible
92:59
okay i think what i'm talking about is
93:01
gold locks you know is something that's
93:03
a statement about
93:04
multiple threats running on the same
93:05
machine right and so
93:08
if the go lock disappears because the
93:10
machine crashes all the threads on that
93:11
machine crashed too
93:20
right but when you say that the
93:23
intermediate state is visible
93:25
to other people it that's still true for
93:28
gold locks though right
93:30
you know if they written persistent
93:31
state to disk uh
93:34
or into some shared file system but like
93:36
no the machine is gone the disk is gone
93:37
everything's gone
93:40
oh got it okay so it's saying that the
93:42
pers like the persistence
93:44
like the intermediate state is
93:45
persistent the zookeeper intermediate
93:47
state might be visible right not to
93:49
delete the thing the guy might have
93:50
created some more files and you know
93:52
those are visible now
93:54
okay i see thank you so just to follow
93:57
up on that so it's the implication that
94:00
if a goal routine ever dies while
94:02
holding the walk the entire go program
94:05
must have died too
94:07
and there's you can never have a go
94:09
routine die holding a lock with that
94:10
with
94:10
and like have other parts of the program
94:12
continue no
94:13
okay the go routine crashes the
94:16
application crashes