字幕記錄


00:01
alright last time I started talking
00:05
about linearize ability and I want to
00:06
finish up this time the reason why we're
00:09
talking about it again is that it's our
00:11
kind of standard definition for what
00:14
strong consistency means in storage
00:19
style systems so for example lab 3 is a
00:23
needs to obey your lab 3 needs to be
00:28
linearizable and sometimes this will
00:31
come up because we're talking about a
00:32
strongly consistent system and we're
00:34
wondering whether a particular behavior
00:35
is acceptable and other times linearize
00:39
ability will become come up because
00:40
we'll be talking about a system that
00:42
isn't linearizable and we'll be
00:44
wondering you know in what ways might it
00:47
fall short or deviate from linearize
00:49
ability so one thing you need to be able
00:51
to do is look at a particular sequence
00:55
of operations a particular execution of
00:59
some system that executes reads and
01:00
writes like your lab 3 and be able to
01:03
answer the question oh was that was that
01:04
Stevens of operations I just saw
01:06
linearizable or not I'm so we're going
01:10
to continue practicing that a little bit
01:11
now plus I'll try to actually establish
01:16
some interesting facts that will be
01:17
helpful for us about what it means about
01:19
the consequences for the systems we
01:21
build and look at linearize ability is
01:26
to find on particular operation history
01:28
so always the thing we're talking about
01:31
is oh we observed you know a sequence of
01:34
requests by clients and then they got
01:36
some responses at different times and
01:39
they asked for different different you
01:41
know to read different data and got
01:44
various answers back you know is that
01:46
history that we saw linearizable ok so
01:50
here's an example of a history that
01:53
might or might not be linearized able so
01:55
let's suppose at some point in time some
01:57
client groups of times gonna move to the
01:59
right this vertical bar marks the time
02:02
at which a client sent a request I'm
02:05
gonna use this notation to mean that the
02:08
request is a write and asks to set
02:12
variable or key or whatever x2 value 0
02:17
so sort of a key and a value this would
02:20
correspond to a put of key X and by zero
02:23
in lab 3 and then this is sort of we're
02:28
watching what the client send the client
02:29
sent this request to our service and at
02:32
some point the service responded and
02:34
said yes you're right is completed so
02:37
we're assuming the services of a nature
02:38
that actually tells you when the write
02:40
completes otherwise the definition isn't
02:43
very useful ok so we have this request
02:46
by somebody to write and then I'm
02:49
imagining in this example there's
02:51
another request that because I'm putting
02:54
this mark here this means the second
02:56
request started after the first request
02:59
finished and and you know reason why
03:01
that's important is because of this rule
03:04
that linearizable history must match
03:07
real time and what that really means is
03:10
that requests that are known in real
03:12
time to have started after some other
03:15
request finished the second request has
03:18
to occur after the first request in
03:20
whatever order we work out that's the
03:22
proof that the history was a linearized
03:24
linearize available ok so in this
03:27
example I'm imagining there's another
03:29
request that asks to write X to have
03:31
value 1 and then a concurrent request
03:36
may be started a little bit later as to
03:39
set X to 2 I said now we have two maybe
03:44
two different clients issued requests at
03:46
about the same time to set X to two
03:47
different values so of course we're
03:49
wondering which one is going to be the
03:51
real value and then we also have some
03:54
reads if all you have is writes well
04:01
well you have us right so it's it's hard
04:04
to say too much about linearizable
04:05
linearize ability because you don't know
04:07
you don't have any proof that the system
04:10
actually did anything or revealed any
04:12
values so we really need reads so let's
04:17
imagine we have some read unless you'll
04:21
be seeing our in the history
04:23
that a client said to read at this time
04:27
and the second time it got an answer for
04:30
red key accent got value to so
04:34
presumably actually saw this value and
04:37
then there was another request by maybe
04:39
the same client or a different client
04:40
but known to have started in time after
04:43
this request finished and this read of X
04:46
got value while and so the question in
04:50
front of us is is this history
04:52
linearizable and there's sort of two
04:55
strategies we can take we can either
04:58
cook up a sequence because if we can
05:00
come up with a total order of these five
05:03
operations that obeys real time and in
05:06
which each read sees the value written
05:09
by the priest most recently proceeding
05:12
right in the order if we can come up
05:13
with that order then that's a proof the
05:15
history is linearizable another strategy
05:17
is to observe that these rules each one
05:23
may imply certain this comes before that
05:26
edges in a graph and if we can find a
05:29
cycle in this operation must come before
05:32
that operation we can find a psych on
05:33
that graph and that's proof that the
05:35
history isn't linearizable and for small
05:39
histories we may actually be able to
05:40
enumerate every single order and use
05:42
that show this history isn't
05:45
linearizable anyway any any any thoughts
05:51
about whether this might be or might not
05:53
be linearizable
06:00
yes
06:08
yes okay so the observation is that um
06:11
it's a little bit troubling that we saw
06:15
read with IU - and then the read with
06:17
value want and maybe that contradicts
06:21
you know there were two rights one with
06:23
value one on one value - so that so we
06:25
certainly if we had to read with value
06:26
three that would obviously be something
06:28
I got terribly wrong you know but we got
06:31
there were a right of one in two and a
06:32
read of one and two so the question is
06:34
whether this order of reads could
06:36
possibly be reconciled with the way
06:39
these two rights show up in the history
06:59
okay so what I'm what I'm the game we're
07:04
playing is that we have a like maybe two
07:07
clients or three clients and they're
07:08
talking some service you know maybe a
07:10
raft last year something and what we are
07:12
seeing is requests and responses right
07:15
so what this means is that we saw
07:18
requests from a client to write X to the
07:21
you know put requests for X and one and
07:23
we saw the response here so what we know
07:25
is that somewhere during this interval
07:27
of time presumably the service actually
07:29
internally change the value of x - 1 and
07:32
what this means is that somewhere in
07:34
this interval of time the service
07:38
presumably changed its internal idea of
07:42
the value of x - 2 somewhere in this
07:44
time but you know it's just somewhere in
07:47
this time it doesn't mean it happened
07:48
here or here does that answer your
07:52
question
07:53
yes
08:07
yes okay so the observation is that is
08:10
linearizable and it's been accompanied
08:13
by an actual proof of the linearize
08:15
ability namely a demonstration of the
08:17
order that shows that it is linearizable
08:19
and the order is yes it's linearizable
08:25
and the order is first right of X with
08:32
value 0 and the server got both of these
08:36
rights at roughly the same time it's
08:38
still had to choose the order itself all
08:40
right so let's just say it could have
08:43
executed the right of x2 value 2 first
08:45
and then the read of X then executed the
08:54
read of X which would the first read of
08:58
X which at that point would yield 2 and
09:01
then we're gonna say the next operation
09:03
had executed it was the right of X to 1
09:05
and then the last operation in the
09:08
history is the read of X to 1 and so
09:16
this is proof that the history is
09:17
linearizable because here's an order
09:19
it's a total order of the operations and
09:22
this is the order it matches real time
09:24
so what that means is well just go
09:29
through it the the right of X to 0 comes
09:31
first and that's that's totally
09:32
intuitive since it's actually finished
09:33
before any other operations started the
09:37
right of X to 1 comes sorry the rate of
09:40
X to 2 comes second so we're gonna say
09:42
maybe that I'm gonna mark here that sort
09:46
of real time at which we imagine these
09:48
operations happen to demonstrate that
09:50
the order here does match real time so
09:52
it'll say I'll just write a big X here
09:54
to mark the time when we imagine this
09:56
operation happened all right so that's
09:59
the second operation then we're
10:01
imagining that the next operation is the
10:03
read of X of 2 we you know there's no
10:07
real time problem because the read of X
10:08
of 2 actually was this u concurrently
10:11
with the right of X of 2 you know it's
10:13
not like the right of X the read of X of
10:14
2 finished and only then did the right
10:16
of X right of X with to start there
10:18
really are concurrent we'll just imagine
10:20
that that
10:21
sort of point in time at which this
10:23
operation happened is right there so
10:25
this is the you know we don't care when
10:27
this one happened let's just say there's
10:28
the first operation second third now we
10:33
have a right of X of one let's just say
10:36
it happens here in real time just has to
10:39
happen after the operations that occur
10:42
before it in the order so that will say
10:43
there's the fourth operation and now we
10:45
have the read of x1 and it can pretty
10:47
much happen at any time but let's say it
10:48
happens here okay so this is the
10:52
Diamonds
10:52
so we have the order this is the
10:54
demonstration that the order is
10:56
consistent with real time that is we can
10:58
pick a time for each of the operations
11:00
that's within its start and end time
11:02
that would cause this total order to
11:06
match our real time order and so the
11:08
final question is did each read see the
11:12
value written by the most closely
11:14
preceding right of the same variable so
11:16
there's two V's this read preceded by a
11:20
right with that correct value so that's
11:21
good and this read is preceded by a
11:23
right most closely preceded by a right
11:26
of the same value also okay so this this
11:32
is a demonstration that this history was
11:34
linearizable and you know the you know
11:39
depends on what you thought when you
11:40
first saw the history but it's not
11:42
always immediately clear that set up
11:44
this complicated is you know it's easy
11:47
to be tricked when looking at these
11:50
histories which do you think oh the
11:51
right of x1 started first so we just
11:53
sort of assumed that the first value
11:56
written must be one but that's actually
11:57
not required here any questions about
12:03
this
12:15
if the you mean if these two were moved
12:18
like this the okay so if if if this if
12:23
the right with value to was only issued
12:26
by the client after the read of accent
12:30
value to returned that wouldn't be
12:33
linearizable because in whatever order
12:37
you know any order we come up with has
12:39
to obey the real-time ordering so any
12:41
order we come up with would have had to
12:42
have the read of X with to precede the
12:46
right of X with 2 and since there's no
12:48
other right of X of 2 insight here that
12:52
means that a read at this point could
12:55
only see 0 or 1 because those are the
12:57
only other 2 rights that could possibly
12:59
come before this read so moving you know
13:03
shifting these that much makes the would
13:06
make the example not linearizable
13:19
yes
13:24
I'm saying that the first vertical line
13:27
is the moment the client sends the
13:29
request and the second vertical line is
13:32
the moment the client receives the
13:34
request yes yeah yeah so this is a very
13:39
client centric kind of definition it
13:42
says you know clients should see the
13:44
following behavior and what happens
13:46
after us send a request in maybe there's
13:48
a lot of replicas maybe a complicated
13:49
network who knows what it's almost none
13:51
of our business we're only the
13:53
definition is only about what clients
13:56
see there's some gray areas which we'll
13:59
come to in a moment like if the client
14:01
should need to retransmit a request then
14:03
we also have to you know that's
14:05
something we have to think about okay so
14:12
this one is linearizable here's another
14:16
example I'm actually going to start out
14:19
with it being almost identical I'm gonna
14:22
start out with you being identical for
14:23
the first example so again we have a
14:25
right of X with 0 we have these two
14:28
concurrent rights and we have the same
14:38
two reads those are so far identical to
14:52
the previous example so therefore we
14:54
know this must be this alone must be
14:56
minimal but I'm going to add let's let's
14:58
imagine that client 1 issued these two
15:02
requests the definition doesn't really
15:04
care about clients but her own sanity
15:06
will assume client 1 red X and saw two
15:08
and then later red X and saw one but
15:11
that's okay so far I say there's another
15:13
client and the other client does a read
15:17
of X and it sees a 1 and then the other
15:23
client is a second read of X and it sees
15:25
- so this is linearizable and we either
15:31
have to come up with an order
15:33
or this comes before that graph that has
15:39
a cycle in it
15:50
so you know that thing this is getting
15:54
at the puzzle is if one client saw
15:57
there's only two rights here so they you
15:59
know in any order or one of the rights
16:01
comes first or the other rate comes
16:03
first and intuitively client one
16:08
observed that the right with value to
16:10
came first and then the right of value
16:14
one right these two reads mean that has
16:19
to be the case that in any legal order
16:21
of the right of two has to come before
16:23
the right of one in order for the climb
16:25
when to have seen this and it's the same
16:27
order we saw over here but symmetrically
16:31
client one's experience clearly shows
16:33
the opposite right sorry huh fine to
16:39
client who's experience was the opposite
16:41
clients to saw the right of one first
16:44
and then the right with value too and
16:47
one of the rules here is that there's
16:52
just one total order of operations not
16:56
allowed to have different clients see
16:58
different histories or different
17:00
different progressions evolutions of the
17:03
values that are stored in the system
17:05
there can only be one total of order
17:07
that all clients have to experience
17:09
operations that are consistent with the
17:11
one order and if one this one client
17:16
clearly implies that the order is right
17:18
- and then right one and so we should
17:21
not be able to have any other client who
17:23
observes proof that the order was
17:25
anything else which is what we have here
17:29
and so that's a bit of a intuitive
17:34
explanation for what's going wrong here
17:37
and and by the way the reason why this
17:38
could come up in the systems that we
17:41
build and look at is that we're building
17:43
replicated systems either you know raft
17:46
replicas or maybe systems with caching
17:48
in them but we're building systems that
17:50
have many copies of the data and so
17:52
there may be many servers with copies of
17:54
X in them possibly with different values
17:57
at different times right if they haven't
17:59
gotten the commits yet or something some
18:01
replicas may have one value some may of
18:03
the other
18:03
but in spite of that if our system is
18:07
linearizable or strongly consistent it
18:09
must behave as if there was only one
18:13
copy of the data and one linear sequence
18:16
of operations applied to the data that's
18:18
why this is an interesting example
18:20
because this could come up in a sort of
18:22
buggy system that had two copies of the
18:25
data and one copy executed these rights
18:27
in one order and the other replicas
18:28
executed the rights in the other order
18:30
and then we could see this and linearize
18:32
ability says no we can't see that we're
18:34
not allowed to see that in the correct
18:35
system so the the cycle in the graph in
18:39
the this comes before that graph that
18:44
would be a sort of slightly more proof e
18:45
proof that this is not linearizable is
18:47
that the right of two has to come before
18:51
client ones read of two so there's one
18:55
arrow like this so this right has to
18:59
come before that read client ones read
19:05
has to come before the right of X with
19:09
value one otherwise this read wouldn't
19:13
be able to see one right if this you can
19:15
imagine this right might happen very
19:17
early in the order but in that case this
19:20
read of X wouldn't see one it would see
19:21
two since we know this guy saw two so
19:25
the read of X with two must come before
19:28
the right of X with one the right of X
19:32
of one must come before any read of X
19:34
with value 1 because including client
19:37
who's read of X with value 1 but in
19:42
order to get value 1 here and for this
19:44
read to see to the right of X with I too
19:47
must come between in in the order
19:50
between these two operations so we know
19:52
that the read of X 1 must come before
19:55
the right of X 2 and that's a cycle
19:59
alright so there's no there's no Vinnie
20:03
or order or that but there's no linear
20:06
order that can obey all of these time
20:10
and value rules and there isn't because
20:14
there's a cycle in the
20:16
in the graph yes
20:32
that's a good question this this
20:35
definitions the definition about
20:37
history's not about necessarily systems
20:41
so what it's not saying is that a system
20:43
design is linearizable if something
20:46
about the design it's really only
20:50
history by history so if we don't get to
20:53
know how the system operates internally
20:55
and the only thing we know is we get to
20:57
watch it while it executes then before
20:59
we've seen anything we just don't know
21:01
right we mean we'll assume it's
21:02
linearizable and then we see more and
21:04
more sequences of operations this Akash
21:06
they're all consistent with linearize
21:08
ability they all follow these rules so
21:10
you know we believe it's probably this
21:12
isn't linearize of all and if we ever
21:14
seen one that isn't then we realize it's
21:15
not linearizable so this is yeah it's
21:20
not a definition on the system design
21:22
it's a definition on what the what we
21:23
observe the system to do so in that
21:25
sense it's maybe a little bit
21:27
unsatisfying if you're trying to design
21:28
something right there's not a recipe for
21:30
how you design you know except in a
21:32
trivial sense that if you had a single
21:33
server in very simple systems one server
21:37
one copy of the data not threaded or
21:40
multi-core or anything it's a little bit
21:42
hard to build a system that violates
21:43
this in a very simple set up but super
21:46
easy to violate it in any kind of
21:50
distributed system okay so the lesson
21:55
from this is that there's only can only
21:59
be one order in which the system is
22:04
observed to execute the writes all
22:07
clients have to see value is consistent
22:10
with the system executing the writes in
22:13
the same order
22:18
here's a very simple history
22:19
another example supposing we write acts
22:24
with value 1 and then definitely
22:27
subsequently in time maybe with another
22:29
client another client launches a right
22:32
of X with value 2 and sees a response
22:34
back from the service saying yes I did
22:36
the right and then a third client does a
22:38
read of X and gets got you one so this
22:45
is a very easy example it's clearly not
22:47
linearizable because the time rule means
22:50
that the only possible order is the
22:54
right of X with 1 the right of X is 2
22:55
the read of X with 1 so that has to be
22:57
the order and that order clearly
22:59
violates this is the only one order that
23:01
order clearly violates the second rule
23:03
about values that is you know the most
23:06
value written by the most recent right
23:08
in the owned one order that's possible
23:10
is not 1 it's 2 so this is clearly not
23:13
linearizable and the reason I'm bringing
23:18
it up is because this is the argument
23:20
that a linearizable system a strongly
23:23
consistent system cannot serve up stale
23:25
data right and you know the reason why
23:29
this might come up is again you know
23:31
maybe you have lots of replicas each you
23:34
know maybe haven't seen all the rights
23:36
or all the committed rights or something
23:37
so maybe there's some maybe all the
23:40
replicas have seen this right but only
23:42
some replicas have seen this right and
23:43
so if you ask a replica that's lagging
23:45
behind a little bit it's still gonna
23:47
have value 1 for X but nevertheless
23:50
clients should never be able to see this
23:53
old value in a linearizable system are
23:57
there no stale data allowed no still
24:00
reads
24:21
yeah if there's overlap in the interval
24:23
then there's then you know that you
24:26
could the system could legally execute
24:29
either of them in a real-time and I in
24:31
the interval so that's the sense in
24:32
which they could system gonna execute
24:35
them in either order now you know other
24:38
you know if it weren't for these two
24:40
reads the system would have you know
24:43
total freedom execute that writes in
24:45
either order but because we saw the two
24:47
reads we know that you know the only
24:52
legal order is two and then one yeah so
25:01
if the two reserva laughing then and
25:03
then any order then the reads could have
25:04
seen either in fact you know Toby saw
25:07
the two and the one words all from the
25:08
reads these doobies could have you know
25:11
the system until it committed to the
25:13
values for the read it still had freedom
25:15
to return them in either order
25:23
I'm using them as synonyms yeah yeah you
25:28
know for most people although possibly
25:31
not today's paper linearize ability is
25:34
is well defined and and people's
25:37
definitions really deviate very much
25:39
from this strong consistency though is
25:43
less I think there's less sort of
25:45
consensus about exactly what the
25:47
definition might be if you meant strong
25:49
consistency it's often men it's usually
25:53
men too in ways that are quite close to
25:55
this
25:55
like for example that oh the system
25:57
behaves the same way that a system with
26:00
only one copy of the data would behave
26:02
all right which is quite close to what
26:05
we're getting at with this definition
26:06
but yeah for you know it's reasonable to
26:10
assume that strong strong consistency is
26:12
the same as serializable okay so this is
26:18
not linearizable and the you know the
26:21
the lesson is weeds are not allowed to
26:26
return stale data only only fresh data
26:29
or you can only return the results of
26:33
the most recently completed right
26:44
okay I have a final final little example
26:51
so we have two clients one of them
26:54
submits a write to X with value three
26:58
and then write two acts with value 4 and
27:04
we have another client and you know at
27:07
this point in time the client issues a
27:10
read of X but and this is a question you
27:15
asked the the client doesn't get a
27:18
response right you know who knows like
27:21
it in the sort of actual implementation
27:23
may be the leader crashed at some point
27:25
maybe the his client to sent in the read
27:27
request so the leader maybe didn't get
27:29
it because the request was dropped or
27:31
maybe the leader got the request and
27:34
executed it but the response the network
27:36
dropped the response or maybe the leader
27:38
got it and started to process up and
27:40
then crash before finished processing
27:42
and or maybe did process it and crash
27:44
before saying the response who knows
27:45
when the clients point of view like sent
27:48
a request and never got a response so in
27:50
the interior machinery of the client for
27:52
most of the systems we're talking about
27:53
the client is going to resend the
27:55
request maybe do a different leader
27:57
maybe the same leader who knows what so
27:59
it sent the first question quest here
28:01
and maybe it sends the second request at
28:05
this point in time it times out you know
28:07
no response sends the second request at
28:10
this point and then finally gets a
28:12
response it turns out that and you're
28:19
going to implement this in lab 3 that a
28:22
reasonable way of servers dealing with
28:26
repeated requests is for their servers
28:28
to keep tables sort of indexed by some
28:31
kind of unique request number or
28:32
something from the clients in which the
28:35
servers remember oh I already saw that
28:36
request and executed it and this was the
28:38
response that I sent back because you
28:41
don't want to execute a request twice
28:42
you know if it's a for example if it's a
28:45
write request you don't want to execute
28:47
requests right so the server's have to
28:49
be able to filter out duplicate requests
28:51
and they have to be able to return the
28:53
reply to repeat the reply that the
28:56
originally
28:57
sent to that request which perhaps has
28:59
dropped by the network so that servers
29:00
remember the original pry and repeat it
29:04
in response to the resend and if you do
29:07
that which you will in lab 3 then if you
29:12
know since the server the leader could
29:16
have seen value 3 when it executed the
29:18
original read request from client to it
29:20
could return value 3 to the repeated
29:23
requests that was sent at this time and
29:25
completed at this time and so we have to
29:30
make a call on whether that is legal
29:33
right you could argue that oh gosh you
29:38
know the client we sent the request here
29:39
this was after the right of X 2 4
29:41
completed so Jesus what you really
29:42
should return for at this point instead
29:44
of 3 and this is like a little bit a
29:52
question of it's like a little bit up
29:55
the designer but if what you view is
29:58
going on is that the retransmissions are
30:00
a low-level concern that's you know part
30:05
of the RPC machinery or hidden in some
30:07
library or something and that from the
30:08
client applications point of view all
30:10
that happened was that it's sent a
30:12
request at this time and got a response
30:15
at this time and that's all that
30:17
happened from the clients point of view
30:18
then a value of 3 is totally legal here
30:21
because this request took a long time
30:24
it's completely concurrent with the
30:26
right not ordered in real time with the
30:28
right and therefore either the three or
30:31
the four is valid you know as if the
30:34
read requests that really executed here
30:36
in real time or or here in real time so
30:39
the larger lesson is if you have client
30:42
retransmissions the from the application
30:47
point of view if you're defining
30:48
linearize ability from the applications
30:50
point of view - even with
30:53
retransmissions the real time extent of
30:56
the requests like this is from the very
30:59
first transmission of the requests to
31:01
the final time at which the application
31:03
actually got the response maybe after
31:05
many reasons yes
31:24
you might rather you got fresh data than
31:27
stale data you know if I'm you know
31:30
supposing the request is what time it
31:32
what time is it that's a time server I
31:34
sent a request saying Oh what time is it
31:35
and it sends me a response you know yeah
31:38
if I send a request now and I don't get
31:40
the response until 2 minutes from now
31:42
dude some Network issue it may be that
31:46
the application would like prefer to see
31:48
we're gonna get the response it would
31:50
prefer to see a time that was close to
31:53
the time at which had actually got the
31:54
response rather than a time deep in the
31:56
past when it originally sent the request
31:58
now the fact is that if you you know if
32:02
you're using a system like this you have
32:03
to write applications that are tolerant
32:05
of these rules you're using a
32:09
linearizable system like these are the
32:11
rules and so you must write you know
32:13
correct applications must be tolerant of
32:15
you know if they send a request and they
32:17
get a response a while later they just
32:19
you know you can't are not allowed to
32:22
write the application as if oh gosh if I
32:25
get a response that means that the value
32:27
at the time I got the response was equal
32:30
to 3 that is not OK for applications to
32:32
think you know what that I have that
32:34
plays out for a given application
32:35
depends on what the application is doing
32:40
the reason I bring this up is because
32:42
it's a common question in 6 6 8 to 4 you
32:45
guys will implement the machinery by
32:47
which servers detect duplicates and
32:51
resend the previous answer that the
32:56
server originally sent and the question
32:57
will come up is it ok if you originally
33:00
saw the request here to return at this
33:02
point in time the response that you
33:05
would have sent back here if the network
33:08
hadn't dropped it and it's it's handy to
33:11
have a kind of way of reasoning I mean
33:13
one reason to have definitions like
33:15
linearize abilities to be able to reason
33:16
about questions like that right i'm
33:18
using using this scheme we can say well
33:21
it actually is okay by those rules
33:26
all right that's all i want to say about
33:28
linearize ability of any any lingering
33:32
questions yeah
33:45
well you know maybe I'm taking liberties
33:49
here but what's going on is that in real
33:55
time we have a read of - and a read of
33:57
one and the read of one really came
33:59
after in real time the read of two and
34:01
so must come must be in this order in
34:04
the final order that means there must
34:07
have been a right of - somewhere in here
34:11
it's our right with value one somewhere
34:13
in here that is after the read of - in
34:15
the final order right after the read of
34:17
- and before the read of one in that
34:20
order there must be a right with value
34:22
one there's only one right with a value
34:23
unavailable you know if there were more
34:25
than one we maybe could play games but
34:27
there's only one available so this right
34:29
must slip in here in the final order or
34:31
therefore I felt able to draw this arrow
34:38
and these arrows just capture the sort
34:42
of one by one implication of the rules
34:44
on what the order must look like yeah
35:02
all right yeah I mean any hour or X so
35:06
which sorry which which yeah his own rx1
35:16
he sees it before his own rx1 okay so
35:19
the via yep well we're not we're not
35:32
we're not really able to say which of
35:35
these two wheats came first so we can't
35:39
quite for all this error if we mean this
35:41
arrow to constrain the ultimate order
35:43
we're not you know the these two weeds
35:46
could come in either order so we're not
35:48
allowed to say this one came before that
35:49
one it could be there's a simpler cycle
35:52
actually then I've drawn so I mean it
35:55
may be because certainly the that the
35:58
damage is in these four items I agree
36:02
with that these two these four items
36:05
kind of are the main evidence that
36:08
something is wrong
36:09
now whether a cycle I'm not sure whether
36:11
there's a cycle that just involves that
36:13
there could be okay this is worth
36:16
thinking about cuz you know if I can't
36:19
think of anything better or I'll
36:20
certainly ask you a question about
36:21
linearizable histories on midterm
36:31
okay so today's paper today's paper
36:36
zookeeper and I mean part of the reason
36:43
we're even zookeeper paper is that it's
36:45
a successful real world system it's an
36:46
open source you know service that
36:50
actually a lot of people ron has been
36:51
incorporated into a lot of real world
36:53
software so there's a certain kind of
36:55
reality and success to it but you know
36:59
that makes attractive from the point of
37:00
view of kind of supporting the idea that
37:04
the zookeepers design might actually be
37:05
a reasonable design but the reason we're
37:08
interested in in it I'm interested in it
37:10
is for to somewhat more precise
37:12
technical points so why are we looking
37:18
at this paper so one of them is that in
37:23
contrast to raft like the raft you've
37:25
written and raft as that's defined it's
37:27
really a library you know you can use a
37:29
raft library as a part of some larger
37:31
replicated system but raft isn't like a
37:34
standalone service or something that you
37:36
can talk to it's you really have to
37:38
design your application to interact at
37:40
the raft library explicitly so you might
37:45
wonder it's an interesting question
37:46
whether some useful system sort of
37:52
standalone general-purpose system could
37:54
be defined that would be helpful for
37:56
people building separate distributed
37:59
systems like is there serve some service
38:00
that can bite off a significant portion
38:02
of why it's painful to build distributed
38:04
systems and sort of package it up in a
38:06
standalone service that you know anybody
38:09
can use so this is really the question
38:14
of what would an API look like for a
38:16
general purpose I'll call it I'm not
38:24
sure what the right name for things like
38:25
zookeeper is but you've got a general
38:27
purpose coordination service
38:33
and the other question the other
38:37
interesting aspect of zookeeper is that
38:41
when we build replicated systems and
38:44
zookeepers a replicated system because
38:45
among other things it's it's like a
38:47
fault-tolerant
38:48
general-purpose coordination service and
38:51
it gets fault tolerance like most
38:53
systems by replication that is there's a
38:55
bunch of you know maybe three or five or
38:57
seven or who knows what
38:58
zookeeper servers it takes money to buy
39:03
those servers right like a 7 server
39:05
zookeeper setup is 7 times expensive as
39:09
a sort of simple single server so it's
39:13
very tempting to ask if you buy 7
39:16
servers to run your replicated service
39:17
can you get 7 times the performance out
39:20
of your 7 servers right and you know how
39:24
could we possibly do that so the
39:29
question is you know we have n times as
39:31
many servers can that yield us n times
39:35
the performance so I'm gonna talk about
39:42
the second question first so from the
39:46
point of view this discussion about
39:47
performance I'm just going to view
39:50
zookeeper as just some service we don't
39:53
really care what the service is but
39:54
replicated with a raft like replication
39:57
system zookeeper actually runs on top of
39:59
this thing called Zab which for our
40:03
purposes
40:06
we'll just treat as being almost
40:09
identical to the raft and I'm just
40:14
worried about the performance of the
40:15
replication I'm not really worried about
40:17
what zookeepers specifically is up to so
40:20
the general picture is that you know we
40:22
have a bunch of clients maybe hundreds
40:24
maybe hundreds of clients and we have
40:27
just as in the lads we have a leader the
40:35
leader has a zookeeper layer that
40:37
clients talk to and then under the
40:39
zookeeper layer is the xab system that
40:42
manages replication then just like rafts
40:44
what was a a lot of what's that is doing
40:47
is maintaining a log that contains the
40:49
sequence of operations that clients have
40:51
sent in really very similar to raft may
40:57
have a bunch of these and each of them
41:01
has a log but it's a pending new request
41:10
that's a familiar set up so the
41:15
Clinton's in a request and the Zab layer
41:18
you know sends a copy of that request to
41:21
each of the replicas and the replicas
41:24
append this to their in-memory law I'd
41:25
probably persisted onto a disk so they
41:28
can get it back if they crash and
41:29
restart so the question is as we add
41:35
more servers you know we could have four
41:36
servers or five or seven or whatever
41:38
does the system get faster as we add
41:41
more more CPUs more horsepower to it do
41:48
you think your labs will get faster as
41:50
you have more replicas assuming they're
41:53
each replicas its own computer right so
41:56
that you really do get more CPU cycles
41:58
as you add more revenues
42:09
between all the
42:17
yeah yeah there's nothing about this
42:19
that makes it faster as you add more
42:20
servers right it's absolutely true like
42:23
as we have more servers you know the
42:25
leader is almost certainly a bottleneck
42:27
cuz the leader has to process every
42:28
request and it sends a copy of every
42:30
request to every other server as you add
42:31
more servers it just adds more work to
42:33
this bottleneck node right you're not
42:36
getting any benefit any performance
42:37
benefit out of the added servers because
42:39
they're not really doing anything
42:40
they're just all happily doing whatever
42:42
the leader tells them to do they're not
42:45
you know subtracting from the leaders
42:48
work and every single operation goes to
42:50
the leader so for here you know the
42:54
performance is you know inversely
42:56
proportional to the number of servers
42:58
that you add you add more servers this
42:59
almost certainly gets lower because the
43:02
leader just has more work so in this
43:04
system we have the problem that more
43:06
servers makes the system slower that's
43:15
too bad you know these servers cost a
43:16
couple thousand bucks each and you would
43:18
hope that you could use them to get
43:20
better performance yeah
43:33
okay so the question is what if the
43:35
requests may be from different clients
43:38
or successive requests and same client
43:39
or something what if the requests apply
43:41
two totally different parts of the state
43:43
so you know in a key value store maybe
43:45
one of them is a put on X and the other
43:46
was a put on Y like nothing to do with
43:48
each other you know can we take
43:52
advantage of that and the answer that is
43:55
absolutely now not in this framework
43:57
though or it's the center which we can
44:00
take advantage of it it's very limited
44:02
in this framework it could be well at a
44:06
high level the leader the requests all
44:08
still go through the leader and the
44:11
leader still has to send it out to all
44:13
the replicas and the more replicas there
44:15
are the more messages the leader has to
44:17
send so at a high level it's not likely
44:19
to this sort of commutative or community
44:23
of requests is not likely to help this
44:25
situation is a fantastic thought to keep
44:28
in mind though because it'll absolutely
44:29
come up in other systems and people will
44:32
be able to take advantage of it in other
44:34
systems okay so so there's a little bit
44:39
disappointing facts with server hardware
44:41
wasn't helping performance so a very
44:48
sort of obvious maybe the simplest way
44:52
that you might be able to harness these
44:54
other servers is build a system in which
44:57
ya write requests all have to go through
44:59
the leader but in the real world a huge
45:03
number of workloads are read heavy that
45:05
is there's many more reads like when you
45:06
look at web pages you know it's all
45:07
about reading data to produce the web
45:09
page and generally there are very
45:11
relatively few rights and that's true of
45:13
a lot of systems so maybe we'll send
45:15
rights to the leader but send weeds just
45:18
to one of the replicas right just pick
45:21
one of the replicas and if you have a
45:22
read-only request like a get in lab 3
45:24
just send it to one of the replicas and
45:26
not to the leader now if we do that we
45:29
haven't helped rights much although
45:30
we've gotten a lot of read workload off
45:32
the leader so maybe that helps but we
45:33
absolutely have made tremendous progress
45:36
with reads because now the more servers
45:38
we add the more clients we can support
45:42
right because we're just splitting the
45:44
client lead work
45:45
across the different replicas so the
45:48
question is if we have clients send
45:51
directly to the replicas are we going to
45:55
be happy
46:07
yeah so up-to-date does the right is the
46:10
right word in a raft like system which
46:13
zookeeper is if a client sends a request
46:17
to a random replica you know sure the
46:20
replica you know has a copy the log in
46:22
it you know it's been executing along
46:24
with the leader and you know for lab 3
46:26
it's got this key value table and you
46:29
know you do a get for key X and it's
46:31
gonna have some four key exodus table
46:34
and it can reply to you so sort of
46:36
functionally the replicas got all the
46:38
pieces it needs to respond to client to
46:40
read requests from clients the
46:44
difficulty is that there's no reason to
46:47
believe that anyone replicas other than
46:49
the leader is up to date because well
46:54
there's a bunch of reasons why why
46:56
replicas may not be up to date one of
46:57
them is that they may not be in the
46:59
majority that the leader was waiting for
47:02
you think about what raft is doing the
47:04
leader is only obliged to wait for
47:05
responses to its append entries from a
47:07
majority of the followers and then it
47:10
can commit the operation and go on to
47:11
the next operation so if this replica
47:14
wasn't in the majority it may never have
47:16
seen a riot it may be the network
47:18
dropped it and never got it and so yeah
47:20
you know the leader and you know a
47:25
majority of the servers have seen the
47:27
first three requests but you know this
47:30
server only saw the first two it's
47:31
missing B so read to be a read of you
47:35
know what should be there I'll just be
47:37
totally get a stale value from this one
47:40
even if this replica actually saw this
47:45
new log entry it might be missing the
47:47
commit command you know this zookeepers
47:50
app as much the same as raft it first
47:52
sends out a log entry and then when the
47:54
leader gets a majority of positive
47:55
replies the leader sends out a
47:57
notification saying yeah I'm gonna
47:58
committing that log entry I may not have
48:01
gotten the commit and the sort of worst
48:03
case version of this although its
48:04
equivalent to what I already said is
48:05
that for all this client for all client
48:08
to knows this replica may be partitioned
48:14
from the leader or may just absolutely
48:16
not be in contact with leader at all and
48:17
you know the follower doesn't really
48:19
have a way of knowing
48:20
that actually it's just been cut off a
48:23
moment ago from the leader and just not
48:25
getting anything so you know without
48:29
some further cleverness if we want to
48:32
build a linearizable system we can't
48:35
play this game of sending the attractive
48:37
it as it is for performance we can't
48:38
play this game at replicas sending a
48:40
read request to the replicas and you
48:43
shouldn't do it for lab 3 either because
48:44
that 3 is also supposed to be
48:47
linearizable it's any any questions
48:53
about why linearize ability forbids us
48:57
from having replicas serve clients ok
49:02
you know that the proof is the I lost it
49:07
now but the proof was that simple
49:11
reading you know right one right to read
49:13
one example I put on the board earlier
49:16
you not a lot just you know this is not
49:19
allowed to serve stale data in the
49:21
linear linearizable system ok so how
49:28
does how does ooh keep our deal with
49:29
this zookeeper actually does you can
49:31
tell from table two you look in Table
49:33
two zookeepers read performance goes up
49:35
dramatically as you add more servers so
49:38
clearly zookeepers playing some game
49:39
here which allows must be allowing it to
49:41
return read only to serve read only
49:44
requests from the additional servers the
49:46
replicas so how does ooh keeper make
49:50
this safe
49:59
that's right I mean in fact it's almost
50:01
not allowed to say it does need the
50:02
written latest yeah the way zookeeper
50:05
skins this cat is that it's not
50:06
linearizable right they just like to
50:09
find away this problem and say well
50:10
we're not gonna be we're not going to
50:12
provide linearizable reads and so
50:14
therefore you don't are not obliged
50:17
you know zookeepers not obliged to
50:20
provide fresh data to reads it's allowed
50:23
by its rules of consistency which are
50:25
not linearizable to produce stale data
50:28
for Wheaton's so it's sort of solved
50:31
this technical problem with a kind of
50:33
definitional wave of the wand by saying
50:37
well we never owed you them linearizable
50:38
it'll be in the first place so it's not
50:41
a bug if you don't provide it and that's
50:45
actually a pretty classic way to
50:46
approach this to approach the sort of
50:49
tension between performance and strict
50:53
and strong consistency is to just not
50:55
provide strong consistency nevertheless
50:58
we have to keep in the back of our minds
51:00
question of if the system doesn't
51:03
provide linearize ability is it still
51:07
going to be useful right and you do a
51:09
read and you just don't get the current
51:11
answer or current correct answer the
51:12
most latest data like why do we believe
51:14
that that's gonna produce a useful
51:16
system and so let me talk about that so
51:22
first of all any questions about about
51:26
the basic problem zookeeper really does
51:28
allow client to send read-only requests
51:30
to any replica and the replica responds
51:33
out of its current state and that
51:35
replicate may be lagging it's log may
51:37
not have the very latest log entries and
51:39
so it may return stale data even though
51:42
there's a more recent committed value
51:46
okay so what are we left with
51:51
zookeeper does actually have some it
51:55
does have a set of consistency
51:57
guarantees so to help people who write
52:01
zookeeper based applications reason
52:02
about what their applications what's
52:04
actually going to happen when they run
52:05
them so
52:07
and these guarantees have to do with
52:09
ordering as indeed linearise ability
52:10
does so zookeeper does have two main
52:15
guarantees that they state and this is
52:17
section 2.3 one of them is it says that
52:22
rights rights or linearizable now you
52:33
know there are notion of linearizable
52:34
isn't not quite the same in mine maybe
52:37
because they're talking about rights no
52:40
beads what they really mean here is that
52:43
the system behaves as if even though
52:48
clients might submit rights concurrently
52:50
nevertheless the system behaves as if it
52:52
executes the rights one at a time in
52:55
some order and indeed obeys real-time
52:59
ordering of right so if one right has
53:01
seen to have completed before another
53:03
right has issued then do keeper will
53:05
indeed act as if it executed the second
53:07
right after the first right so it's
53:09
rights but not reads are linearizable
53:12
and zookeeper isn't a strict readwrite
53:17
system there are actually rights that
53:20
imply reads also and for those sort of
53:23
mixed rights those those you know any
53:26
any operation that modifies the state is
53:29
linearizable with respect to all other
53:31
operations that modify the state the
53:37
other guarantee of gives is that any
53:42
given client its operations executes in
53:47
the order specified by the client
53:49
they call that FIFO client order
53:56
and what this means is that if a
53:58
particular client issues a right and
54:00
then a read and then a read and a right
54:02
or whatever that first of all the rights
54:05
from that sequence fit in in the client
54:09
specified order in the overall order of
54:13
all clients rights so if a client says
54:15
do this right then that right and the
54:18
third right in the final order of rights
54:21
will see the clients rates occur in the
54:24
order of the client specified so for
54:26
rights this is our client specified
54:32
order and this is particularly you know
54:38
this is a issue with the system because
54:40
clients are allowed to launch
54:41
asynchronous right requests that is a
54:44
client can fire off a whole sequence of
54:46
rights to the leader to the zookeeper
54:49
leader without waiting for any of them
54:51
to complete and in order resume the
54:53
paper doesn't exactly say this but
54:55
presumably in order for the leader to
54:57
actually be able to execute the clients
54:59
rights in the client specified order
55:00
we're imagining I'm imagining that the
55:03
client actually stamps its write
55:04
requests with numbers and saying you
55:07
know I'll do this one first this one
55:08
second this one third and the zookeeper
55:11
leader obeys that ordering right so this
55:14
is particularly interesting due to these
55:15
asynchronous write requests and for
55:19
reads this is a little more complicated
55:25
the reasons I said before don't go
55:27
through the writes all go through the
55:29
leader the reads just go to some
55:31
replicas and so all they see is the
55:33
stuff that happens to have made it to
55:35
that replicas log the way we're supposed
55:38
to think about the FIFO client order for
55:41
reads is that if the client issues a
55:43
sequence of reads again in some order
55:45
the client reads one thing and then
55:47
another thing and then a third thing
55:48
that relative to the log on the replicas
55:53
talking to those clients reads each have
55:59
to occur at some particular point in the
56:00
log or they need to sort of observe the
56:05
state as it as the state existed at a
56:07
particular point
56:08
the log and furthermore that the
56:11
successive reads have to observe points
56:14
that don't go backwards that is if a
56:17
client issues one read and then another
56:18
read and the first read executes at this
56:20
point in the log the second read is that
56:21
you know allowed to execute it the same
56:24
or later points in the log but not
56:26
allowed to see a previous state by issue
56:29
one read and then another read the
56:30
second read has to see a state that's at
56:32
least as up-to-date as the first state
56:34
and that's a significant fact in that
56:41
we're gonna harness when we're reasoning
56:43
about how to write correct zookeeper
56:45
applications and where this is
56:47
especially exciting is that if the
56:50
client is talking to one replica for a
56:52
while and it issues some reads issue to
56:54
read here and then I read there if this
56:56
replica fails and the client needs to
56:59
start sending its read to another
57:00
replica that guaranteed this FIFO client
57:03
or a guarantee still holds if the client
57:07
switches to a new replica and so that
57:08
means that if you know before a crash
57:10
the client did a read that sort of saw
57:13
state as of this point in the log that
57:16
means when the clients wishes to the new
57:18
replicas if it issues another read you
57:20
know it's its previous read executed
57:22
here
57:23
if a client issues another read that
57:25
read has to execute at this point or
57:27
later even though it's switched replicas
57:29
and you know the way this works is that
57:32
each of these log entries is tagged by
57:35
the leader tags it with a Z X ID which
57:39
is basically just a entry number
57:42
whenever a replica responds to a client
57:45
read request it you know executed the
57:47
request at a particular point and the
57:49
replica responds with the Z X ID of the
57:52
immediately preceding log entry back to
57:54
the client the client remembers this was
57:57
the exid of the most recent data you
58:00
know is the highest z x idea i've ever
58:01
seen and when the client sends a request
58:04
to the same or a different replica it
58:07
accompanies their request with that
58:09
highest CX ID has ever seen and that
58:11
tells this other replica aha you know i
58:14
need to respond to that request with
58:16
data that's at least relative to this
58:19
point in a log
58:21
and that's interesting if this you know
58:22
this replicas not up this second replica
58:25
is even less up to date yes was then
58:28
received any of these but it receives a
58:29
request from a client the client says oh
58:31
gosh the last read I did executed this
58:34
spot in the log and some other replica
58:36
this replica needs to wait until it's
58:38
gotten the entire log up to this point
58:41
before it's allowed to respond to the
58:42
client and I'm not sure exactly how that
58:46
works but either the replicas just
58:48
delays responding to the read or maybe
58:51
it rejects the read and says look I just
58:52
don't know the information talk to
58:53
somebody else or talk to me later
58:55
where's eventually the you know this
58:57
replica will catch up if it's connected
58:59
to the leader and then you won't be able
59:01
to respond
59:04
okay so reads are ordered they only go
59:06
forward in time or only go forward in
59:08
sort of log order and a further thing
59:12
which I believe is true about reason
59:14
rights is that reads and writes the FIFO
59:18
client order applies to all of a clients
59:20
all of a single clients requests so if I
59:22
do a write from a client and I send a
59:25
write to the leader it takes time before
59:28
that write is sent out committed
59:29
whatever so I may send it right to the
59:31
leader the leader hasn't processed it or
59:33
committed it yet and then I send a read
59:36
to a replica the read may have to stall
59:39
you know in order to guarantee FIFO
59:41
client order the read and they have to
59:43
stall until this client has actually
59:45
seen and executed the previous the
59:48
client's previous write operation so
59:52
that's a consequence of this type of
59:53
client order is that a reason rights are
59:55
in the same order and you know the way
59:58
the most obvious way to see this is if a
60:00
client writes a particular piece of data
60:03
you know sends a write to the leader and
60:05
then immediately does a read of the same
60:07
piece of data and sends that read to a
60:09
replica boy it better see its own
60:11
written value right if I write something
60:13
to have value 17 and then I do a read
60:16
and it doesn't have value 17 then that's
60:18
just bizarre and it's evidence that gosh
60:21
the system was not executing my requests
60:23
in order because then it would have
60:25
executed the write and then before the
60:27
read so there must be some funny
60:29
business with the replicas stalling
60:31
the client must when it sends a read and
60:33
say look you know I the last write
60:35
request I sent a leader with ZX ID
60:37
something in this replica has to wait
60:39
till it sees that I'm the leader yes
60:53
oh absolutely so I think what you're
60:56
observing is that a read from a replica
60:58
may not see the latest data so the
61:00
leader may have sent out C to a majority
61:03
of replicas and committed it and the
61:06
majority may have executed it but if our
61:08
replica that we're talking wasn't in
61:10
that majority maybe this replica doesn't
61:12
have the latest data and that just is
61:14
the way zoo keeper works and so it does
61:17
not guarantee that we'd see the latest
61:20
data so if there there is a guarantee
61:23
about readwrite ordering but it's only
61:25
per client so if I send a write in and
61:28
then I read that data the system
61:31
guarantees that my bead observes my
61:34
right if you send a right in and then I
61:37
read the data that you wrote this isn't
61:39
does not guarantee that I see your right
61:43
and that's and you know that's like the
61:46
foundation of how they get speed up for
61:50
reads proportional to the number of
61:51
replicas
61:58
but I would say the system isn't
62:00
linearizable and and but it is not that
62:04
it has no properties then the rights are
62:07
certainly many all right all rights from
62:09
all clients form some one at a time
62:11
sequence so that's a sense in which the
62:13
rights all rights are the knee risible
62:16
and each individual clients operations
62:21
may be this means linearizable also it
62:27
may you know this this probably means
62:29
that each individual clients operations
62:31
are linearize well though I'm not quite
62:32
sure you know I'm actually not sure how
62:48
it works but that's a reasonable
62:50
supposition then when I send in an
62:52
asynchronous right the system doesn't
62:54
execute it yet but it does reply to me
62:56
saying yeah you know I got your right
62:57
and here's this yaks ID that it will
62:59
have if it's committed I just like start
63:03
return so that's a reasonable theory I
63:04
don't actually know how it does it and
63:06
then the client if it doesn't read needs
63:11
to tell the replicas look you know
63:12
that's right I did you know if I do a
63:31
read of the data is of the operation
63:42
okay so if you send a read to a replica
63:43
the replicas in return you that you know
63:45
really it's a read from this table is
63:47
what your no way notionally what the
63:49
client thinks it's doing so you client
63:51
says all I want to read this row from
63:52
this table the server this replica sends
63:54
back its current value for that table
63:56
plus the GX ID of the last operation
64:00
that updated that table
64:06
yeah so there's so actually I'm I'm not
64:10
prepared to so the the two things that
64:13
would make sense and I think either of
64:14
them would be okay is the server could
64:17
track this yet for every table row the
64:20
ZX ID of the last right operation that
64:22
touched it or it could just to all read
64:25
requests returned the ZX ID as a last
64:27
committed operation in its log
64:29
regardless of whether that was the last
64:31
operation of touch that row because all
64:34
we need to do is make sure that client
64:36
requests move forward in the order so we
64:38
just need something to return something
64:40
that's greater than or equal to the
64:42
right that last touched the data that
64:45
the client read all right so these are
64:54
the guarantees so you know we still left
65:01
with a question of whether it's possible
65:02
to do reasonable programming with this
65:04
set of guarantees and the answer is well
65:06
this you know at a high level this is
65:08
not quite as good as linearizable it's a
65:11
little bit harder to reason about and
65:12
there's sort of more gotchas like reads
65:14
can return stale data just can't happen
65:15
in a linearizable system but it's
65:18
nevertheless good enough to do to make
65:21
it pretty straightforward to reason
65:22
about a lot of things you might want to
65:27
do with zookeeper so there's a so I'm
65:33
gonna try to construct an argument maybe
65:35
by example of why this is not such a bad
65:38
programming model one reason by the way
65:41
is that there's an out there's this
65:42
operation called sink which is
65:44
essentially a write operation and if a
65:47
client you know supposing I know that
65:49
you recently wrote something you being a
65:51
different client and I want to read what
65:53
you wrote so I actually want fresh data
65:54
I can send in one of these sink
65:57
operations which is effectively well the
66:03
sync operation makes its way through the
66:04
system as if it were a write and you
66:07
know finally showing up in the logs of
66:09
the replicas that really at least the
66:12
replicas that I'm talking to and then I
66:14
can come back and do a read and you know
66:18
I can I can tell the replica basically
66:20
don't serve this read until you've seen
66:23
my last sink and that actually falls out
66:26
naturally from fifl client order if we
66:29
if we countersink as a right then five-o
66:33
client order says reads are required to
66:34
see state you know there's as least as
66:37
up to date is the last right from that
66:39
client and so if I send in a sink and
66:41
then I do read I'm the the system is
66:45
obliged to give me data that's visas up
66:47
to date as where my sink fell in the log
66:49
order anyway if I need to read
66:52
up-to-date data send in a sink then do a
66:54
read and the read is guaranteed to see
66:57
data as of the time the same was entered
67:01
into the log so reasonably fresh so
67:05
that's one out but it's an expensive one
67:06
because you now we converted a cheap
67:08
read into the sink operation which
67:11
burned up time on the leader so it's a
67:14
no-no if you don't have to do but here's
67:17
a couple of examples of scenarios that
67:19
the paper talks about that the reasoning
67:23
about them is simplified or reasonably
67:25
simple given the rules that are here so
67:27
first I want to talk about the trick in
67:29
section 2.3 of with the ready file where
67:32
we assume there's some master and the
67:34
Masters maintaining a configuration in
67:36
zookeeper which is a bunch of files and
67:39
zookeeper that describe you know
67:41
something about our distributed system
67:43
like the IP addresses of all the workers
67:45
or who the master is or something so we
67:48
the master who's updating this
67:51
configuration and maybe a bunch of
67:52
readers that need to read the current
67:54
configuration and need to see it every
67:55
time it changes and so the question is
67:57
you know can we construct something that
67:59
even though updating the configure even
68:02
though the configuration is split across
68:03
many files in zookeeper we can have the
68:05
effect of an atomic update so that
68:09
workers don't see workers that look at
68:11
the configuration don't see a sort of
68:13
partially updated configuration but only
68:15
a completely updated that's a classic
68:19
kind of thing that this configuration
68:23
management that zookeeper people using
68:25
zookeeper for so you know looking at the
68:29
so we're copying what section 2.3
68:31
describes this will say the master is
68:34
doing a bunch of rites to update the
68:36
configuration and here's the order that
68:38
the master for our distributed system
68:41
does the rites
68:43
first we're assuming there's some ready
68:44
file a file named ready and if they're
68:47
ready file exists then the configuration
68:49
is we're allowed to read the
68:50
configuration if they're ready files
68:52
missing that means the configuration is
68:53
being updated and we shouldn't look at
68:55
it so if the master is gonna update the
68:58
configuration file the very first thing
68:59
it does is delete the ready file then it
69:07
writes the various files very zookeeper
69:10
files that hold the data for the
69:13
configuration might be a lot of files
69:15
nose and then when it's completely
69:17
updated all the files that make up the
69:19
configuration then it creates again
69:24
that's ready file
69:28
alright so so far the semantics are
69:31
extremely straightforward this is just
69:33
rights there's only rights here no reads
69:35
rights are guaranteed to execute in
69:37
linear order and I guess now we have to
69:42
appeal the fifl client order if the
69:44
master sort of tags these as oh you know
69:46
I want my rights to occur in this order
69:48
then the reader is obliged to enter them
69:52
into the replicated log in that order
69:53
and so though you know the replicas were
69:56
all dutifully execute these one at a
69:57
time they'll all delete the ready file
69:58
then apply this right in that right and
70:01
then create the ready file again so
70:03
these are rights the orders
70:05
straightforward for the reads though
70:08
it's it's maybe a little bit maybe a
70:13
little more thinking as required
70:14
supposing we have some worker that needs
70:16
to read the current configuration we're
70:21
going to assume that this worker first
70:25
checks to see whether the ready file
70:28
exists it doesn't exist it's gonna you
70:31
know sleep and try again so let's assume
70:33
it does exist let's assume we assume
70:35
that the worker checks to see
70:41
if the ready file exists after it's
70:44
recreated and so you know what this
70:46
means now these are all right requests
70:48
sent to the leader this is a read
70:49
request that's just centrally whatever
70:52
replica the clients talking to and then
70:56
if it exists you know it's gonna read f1
71:00
and B that - the interesting thing that
71:07
FIFO client order guarantees here is
71:10
that if this returned true that is if
71:17
the replica the client was talking to
71:18
said yes that file exists then you know
71:21
as were as that what that means is that
71:24
at least with this setup is that as that
71:27
replica that that replica had actually
71:32
seen the recreate of the ready file
71:33
right in order for this exist to see to
71:38
see the ready file exists and because
71:41
successive read operations are required
71:44
to march along only forwards in the long
71:47
and never backwards that means that you
71:49
know if the replicas the client was
71:52
talking to if it's log actually
71:54
contained and then it executes this
71:56
creative the ready file that means that
71:58
subsequent client reads must move only
72:02
forward in the sequence of rights you
72:07
know that the leader put into the log so
72:09
if we saw this ready that means that the
72:11
read occurs that the replica excuse to
72:13
read down here somewhere after the right
72:16
that created the ready and that means
72:18
that the reads are guaranteed to observe
72:19
the effects of these rights so we do
72:22
actually get some benefit here some
72:24
reasoning benefit from the fact that
72:25
even though it's not fully linearizable
72:28
the rights are linearizable and the
72:30
reads have to read sort of monotonically
72:32
move forward in time to the log yes
72:38
[Music]
72:49
yeah so that's a great question so your
72:52
question is well in all this client
72:54
knows you know if this is the real
72:56
scenario that the creators entered in
72:58
the log and then the read arrives at the
73:01
replica after that replica executed this
73:03
creepy ready then everything's
73:04
straightforward but there's other
73:06
possibilities for how this stuff was
73:07
interleaved
73:08
so let's look at a much more troubling
73:11
scenario so the scenario you brought up
73:21
which I happen to be prepared to talk
73:24
about is that yeah you know the the
73:28
master at some point executed to a
73:31
delete of ready or you know way back in
73:36
time some previous master this master
73:40
created the ready file
73:41
you know after it finished updating the
73:44
state I say ready for I existed for a
73:46
while then some new master or this
73:48
master needs to change the
73:48
configurations release the ready file
73:50
you know it doesn't right right and
73:56
what's really troubling is that the
73:58
client that needs to read this
74:00
configuration might have called exists
74:02
to see whether the ready file exists at
74:06
this time all right and you know at this
74:12
point in time yeah sure the ready file
74:14
exists then time passes and the client
74:16
issues the reads for the maybe the
74:18
client reads the first file that makes
74:22
up the configuration but maybe it you
74:25
know and then it reads the second file
74:26
maybe this file this read comes totally
74:29
after the master has been changing the
74:32
configurations so now this reader has
74:35
read this damaged mix of f1 from the old
74:38
configuration and f2 from the new
74:40
configuration there's no reason to
74:42
believe that that's going to contain
74:44
anything other than broken information
74:46
so so this first scenario was great the
74:49
scenario is a disaster and so now we're
74:52
starting to get into
74:54
of like serious challenges which a
74:57
carefully designed API for coordination
75:01
between machines in a distributed system
75:05
might actually help us solve right
75:07
because like for lab 3 you know you're
75:09
gonna build a put get system and a
75:11
simple lab 3 style put guessed system
75:13
you know it would run into this problem
75:15
too and just does not have any tools to
75:17
deal with it
75:18
but the zookeeper API actually is more
75:21
clever than this and it can cope with it
75:23
and so what actually happens the way you
75:27
would actually use ooh keeper is that
75:29
when the client sent in this exists
75:32
request to ask does this file exist and
75:35
would say not only does this file exist
75:37
but it would say you know tell me if it
75:41
exists even set a watch on that file
75:44
which means if the files ever deleted or
75:47
if it doesn't exist if it's ever created
75:48
but in this case if it if it is ever
75:51
deleted please send me a notification
75:56
and furthermore the notifications that
76:01
zookeeper sends you know it's a the
76:04
reader here it's only talking to some
76:05
replicas this is all the replicas doing
76:08
these things for it the replica
76:09
guarantees to send a notification for
76:13
some change to this ready file at the
76:16
correct point relative to the responses
76:20
to the clients reads and so what that
76:25
means so you know because that the the
76:32
implication of that is that in this
76:34
scenario in which you know these these
76:38
rights sort of fit in here in real time
76:40
the guarantee is that if you ask for a
76:44
watch on something and then you issue
76:45
some reads if that replica you're
76:49
talking to execute something that should
76:51
trigger the watch in during your
76:53
sequence of reads then the replica
76:57
guarantees to deliver the notification
76:59
about the watch before it responds to
77:02
any read that came that you know saw the
77:05
log after the point
77:07
of the OP where the operation that
77:10
triggered the watch notification
77:12
executed and so this is the log on the
77:15
replica and so you know if the so that
77:18
you know the FIFO client ordering will
77:21
say you know each client requests must
77:23
fit somewhere into the log apparently
77:25
these fit in here in the log what we're
77:27
worried about is that this read occurs
77:29
here in the log but we set up this watch
77:32
and the guarantee is that will receive
77:34
the note if if somebody deletes this
77:36
file and we can notified then that
77:39
notification will will appear at the
77:40
client before a read that yields
77:43
anything subsequently in the log will
77:48
get the notification before we get the
77:49
results of any read that's that saw
77:52
something in log after the operation
77:54
that produced the notification so what
77:57
this means that the delete ready is
77:58
gonna since we have a watch on the ready
78:00
file that elite ready is going to
78:02
generate a notification and that
78:05
notification is guaranteed to be
78:07
delivered before the read result of f2
78:10
if f2 was gonna see this second right
78:13
and that means that before the reading
78:15
client has finished the sequence in
78:17
which it looks at the configuration it's
78:19
guaranteed to see the watch notification
78:23
before it sees the results of any write
78:26
that happened after this delete that
78:29
triggered the notification
78:39
who generates the watch as well the
78:42
replica let's say the client is talking
78:43
to this replica and it sends in the
78:45
exists request the exist room has a read
78:48
only request it sends with his replica
78:49
the replica is being painting on the
78:51
side a table of watches saying oh you
78:54
know such-and-such a client asked for a
78:55
watch on this file and furthermore the
78:59
watch was established at a particular Z
79:01
X ID that is did a read that client did
79:03
a read with the replica executed the
79:05
read at this point in the log and return
79:07
results are relative to this point in
79:09
the log members owe that watch is
79:12
relative to that point in the log and
79:14
then if a delete comes in you know for
79:17
every operation that there s Q so it
79:20
looks in this little table it says aha
79:21
you know the a there was a watch on that
79:24
file and maybe it's indexed by hash of
79:27
filename or something
79:37
okay so the question is oh yeah this
79:39
this replica has to have a watch table
79:41
you know if the replica crashes and the
79:46
client is officially different replica
79:48
you know what about the watch table
79:50
right it's already established these
79:51
watch and the answer to that is that no
79:52
the rep your replica crashes the new
79:56
replica you switch to won't have the
79:58
watch table and but the client gets a
80:01
notification at the appropriate point in
80:03
in the stream of responses it gets back
80:06
saying oops your replica you were
80:08
talking to you crashed and so the client
80:11
then knows it has to completely reset up
80:13
everything and so tucked away in in the
80:16
examples are missing event handlers to
80:20
say oh gosh you know we need to go back
80:21
and we establish everything if we get a
80:24
notification that our replicas crashed
80:26
all right I'll continuous