字幕記錄


00:02
alright hello everyone let's get started
00:10
today the topic is causal consistency
00:14
and then use the cop system the cop's
00:20
paper that we're every day is a case
00:23
study for causal consistency so the
00:26
setting is actually familiar we're
00:30
talking again about big websites that
00:35
have data in multiple data centers and
00:39
they want to replicate the data in each
00:42
of their all their data in each their
00:43
data centers have to keep a copy close
00:45
to users and for perhaps for fault
00:50
tolerance so as usual we have maybe I'll
00:55
have three data centers
01:02
and you know because we're building big
01:04
systems we're going to shard the data
01:05
and every data center is going to have
01:07
multiple servers with you know maybe all
01:09
the keys that start with Z a through all
01:12
the custodian corresponding shards of
01:14
URLs we've seen this for and you know
01:29
the usual goals people have you know
01:31
there's many different designs for how
01:33
to make this work but you know you
01:34
really like reads to be certainly like
01:37
reads to be fast because these web
01:39
workloads tend to be read dominated and
01:43
you know you'd like rights to work and
01:45
you'd like to have us as much
01:47
consistency as you can so the fast
01:52
reasons are interesting because the
01:53
clients are typically web browsers so
01:55
and there's web going to be some set of
01:59
web browsers which all call clients the
02:03
clients the storage system but they're
02:04
really web browsers talking to a user's
02:07
browser some so the typical arrangement
02:09
is that the reason these happen locally
02:11
and rights might be little more
02:14
complicated so one system that fits this
02:18
pattern is spanner you remember that a
02:22
spanner and spanner rights involved
02:24
Paxos that runs across all the data
02:27
centers so if you do a write in paxos
02:29
maybe a client in a data center needs to
02:33
do a write the communication involve
02:35
actually need requires taxes maybe
02:39
running on one of these servers to talk
02:41
to at least a majority of the other data
02:43
centers that are replicas so the rights
02:45
tend to be a little bit slow but there
02:49
consistent in addition Savannah supports
02:52
two-phase commit so we had transactions
02:54
and the reads are much faster because
02:57
the breeze used a true time scheme that
03:01
the span of paper described and really
03:04
only consulted local we also read the
03:07
Facebook memcache new paper which is
03:09
another design in this demo pattern the
03:13
Facebook memcache key paper there's a
03:15
primary site that has the primary set of
03:18
my sequel databases so if a client wants
03:21
to do a right I suppose the primary side
03:23
this data center 3 does to send all
03:25
rights to data center 3 and then data
03:27
center 3 sends out new information or in
03:30
validations to the other data centers
03:32
right so actually a little bit expensive
03:33
and not unlike spanner on the other hand
03:38
all the reads are local when a client
03:41
needs to do a bead it could consult a
03:42
memcache key server in the local data
03:44
center and there's memcachedb sir just
03:49
blindingly fast this the people
03:51
reporting them a single memcache the
03:53
server conserve a million reads per
03:56
second which is very fast so again the
03:59
Facebook memcache D scheme needs to
04:02
involve cross data center of
04:04
communication for rights but the reads
04:06
are local so the question for today and
04:09
the question of the cops papers
04:11
answering is whether we can have a
04:12
system that allows rights to pursue
04:16
purely locally and this from the clients
04:19
point of views that the client can talk
04:21
to the from once it were right they can
04:22
send the right local replica in its own
04:26
data center as well as some reads to
04:28
just the local replicas and never have
04:31
to wait for other data centers never
04:33
have to talk to other data centers or
04:34
wait for other data centers to do rights
04:38
so what we really want is a system that
04:40
can have local reads and local rights
04:48
that's the big that's the big goal
04:50
really a performance goal this would
04:53
help for performance of course cuz now
04:55
unlike spanner and Facebook paper we had
04:59
a purely local rights be much faster
05:00
from the clients point of view um it
05:04
might also help with fault tolerance
05:05
robustness if rights can be done locally
05:08
then we don't have to worry about
05:09
whether other data centers are up or
05:10
whether we can talk to them quickly
05:12
because the clients don't need to wait
05:15
for them so we're gonna be looking for
05:18
systems that have this this level of
05:20
performance and in the end we're gonna
05:25
let the consistency model you know cuz
05:27
we're going to be worried about
05:28
consistency if you only do the rights
05:29
initially to the local replicas you know
05:31
what about other data centers replicas
05:33
data so we'll certainly be worried about
05:36
consistency
05:36
but the attitude for this lecture at
05:38
least is that we're gonna let the
05:40
consistency trail along behind the
05:42
performance you know once we figure out
05:44
how to get good performance will well
05:46
then sort of figure out how to define
05:48
consistency in think about whether it's
05:51
good enough okay so that's the overall
05:54
strategy I'm gonna actually talk about
05:56
two strawman designs to sort of okay but
06:02
not great designs on the way to before
06:05
we actually talk about how cops works so
06:08
first i want to talk about a simplest
06:12
design that follows this local rating
06:16
strategy that I can think of I'll call
06:19
this straw man 1 so in straw and one
06:27
we're going to have three data centers
06:34
and let's just assume that the data is
06:37
charted two ways in each of them so he's
06:41
from maybe ATM and keys from embassy
06:43
shard it the same way in each of the
06:45
data centers and the clients will read
06:55
locally and if a client writes so
06:59
supposing a client needs to write it key
07:01
that starts with M the clients gonna
07:03
send a write of key M to the shard
07:08
server the local shard server that has
07:10
its responsible he's starting with M
07:13
that shard server would return reply to
07:15
the client immediately saying oh yes I
07:17
did you're right but in addition each
07:22
server will maintain a queue of
07:26
outstanding rights that have been sent
07:29
to it recently got clients that it needs
07:30
to send to other data centers and it
07:32
will stream these rights asynchronously
07:34
in the background to the corresponding
07:38
servers in the other data center so
07:40
after applying to the client our shard
07:42
server here will send a copy of the
07:45
clients right to each of the other data
07:48
setups and you know those rights go
07:52
through the network maybe they take a
07:53
long time eventually they're gonna
07:55
arrive at the target data set the other
07:57
data centers and each of those shard
07:58
servers will then apply the right to its
08:00
local table of data so this is a design
08:07
that has very good performance right the
08:10
reason rights are all done locally may
08:12
never have two clients never have to
08:13
wait there's a lot of parallelism
08:16
because you know this shard server for a
08:18
and the shirts are for a more
08:20
opportunity independently if the shard
08:22
server for a gets right you know it has
08:24
to push its data to the corresponding
08:27
shard servers and other data centers but
08:28
it can do those push pushes
08:30
independently of other shard service
08:33
pushes so there's parallelism both in
08:34
serving and and pushing the writes
08:36
around if you think about it a little
08:41
bit it's this design also essentially
08:44
effectively favors reads and the reads
08:48
really never have any impact beyond the
08:50
local data center the rights though do a
08:52
bit of work whenever you do a right you
08:53
know the client doesn't have to wait for
08:54
it but the shard server then has to push
08:57
the rights out to the other data centers
08:59
and you know means that new data the
09:01
other data center then proceed very
09:03
quickly so reads involve less work than
09:06
rights and that's appropriate for a read
09:09
heavy workload if you are more worried
09:12
about rate performance you could imagine
09:14
other designs for example you can
09:16
imagine design in which reads actually
09:18
have to consult multiple data centers
09:19
and rights are purely local so you can
09:22
imagine a scheme in which you have when
09:23
you do a read you actually read the data
09:25
from each of the other date the current
09:28
copy of the key you want from each of
09:30
the other data centers and choose the
09:32
one that's most recent perhaps and then
09:34
rights are very cheap and breeds are
09:37
expensive or you can imagine
09:38
combinations of these two strategies
09:41
some sort of quorum overlap scheme or
09:44
you write a majority and write a
09:46
majority at the only a majority of data
09:48
centers and meet a majority of data
09:49
centers and rely on the overlap and in
09:53
fact there are real live systems that
09:56
people use in commercially in real
10:00
websites that follow much this design so
10:02
if you're interested in sort of a real
10:04
world version of this you can look up
10:06
Amazon's dynamo system or the open
10:11
source
10:11
Kassandra system
10:14
there was much more elaborated than when
10:17
I've sketched out here but they follow
10:19
the same basic pattern so the usual name
10:23
for this kind of scheme is eventual
10:26
consistency and the reason for that is
10:31
that at least initially if you do a
10:40
write other readers and other data
10:43
centers are not guaranteed EC or right
10:46
but they will someday because you're
10:48
pushing out the rights so they'll
10:50
eventually see your data there's no
10:53
guarantee about order so for example if
10:56
I'm a client and I write he's starting
10:58
with them and then I write a key
11:00
starting with a sure you know M sends
11:03
out it's my right to shards of a rim
11:07
sends out one right and the server for a
11:09
sends out my right for a but you know
11:12
these may travel at different speeds or
11:14
different routes on the wide area
11:15
network and maybe I wrote maybe the
11:17
client wrote em first and then a but
11:19
maybe if they for a arrives first and
11:22
then the update for am and maybe I
11:24
arrive at the opposite order at the
11:26
other datacenter so different clients
11:28
are gonna observe updates in different
11:32
orders so there's you know no order
11:36
guarantee
11:40
the sense the sort of ultimate meaning
11:43
eventual consistency is that if things
11:46
settle down and people stop writing and
11:49
all of these write messages finally
11:51
arrive at their destinations are
11:53
processed then I'm an eventually an
11:56
eventually consistent system ought to
11:59
end up with the same value stored at all
12:03
of the all of the replicas that's the
12:11
sense of which it's eventually
12:12
consistent if you wait for the dust to
12:14
settle you're gonna end up with
12:16
everybody having the same data and
12:18
that's a pretty weak spec that's a very
12:20
weak spec but you know because it's a
12:24
loose spec there's a lot of freedom in
12:26
the implementation and a lot of
12:28
opportunities to get good performance
12:30
because the system basically doesn't
12:31
require you to instantly do anything or
12:34
to observe any ordering rules it's quite
12:38
different from most of the consistency
12:40
schemes we've seen so far again as I
12:43
mentioned it's used in deployed systems
12:45
eventual consistency is but it can be
12:48
quite tricky for application programmers
12:50
so let me sketch out a an example of
12:53
something you might want to do in a web
12:54
and the website where you would have to
12:58
be pretty careful you might be surprised
13:04
if this is an eventual consistency app
13:09
example suppose we're building a website
13:13
that stores photos and every user has a
13:16
you know set of photo photos stored as
13:19
you know key value pairs with some sort
13:21
of unique ID is the key and every user
13:26
has a list of maintains a list of their
13:29
public photos that they allow other
13:31
people to see so supposing I take a
13:34
photograph and I want to insert it into
13:37
this system or you know I human contact
13:40
the web server and the web server runs
13:43
code that's gonna insert my photo into
13:45
the storage system and then add a
13:47
reference to my photo to my photo list
13:50
so maybe
13:51
run maybe this happens we'll say it
13:54
happens on clients c1 which is the web
13:57
server I'm talking to and maybe but the
14:00
code looks like is there's a code calls
14:03
the put operation for my photo
14:07
and it really should be a keen about you
14:09
I'm just gonna candidates just a few
14:13
plus value so I insert my photograph and
14:15
then when this put finishes then I I add
14:19
the photo to my list
14:22
right that's what my my clients code
14:25
looks like somebody else is looking at
14:28
my photographs loosely gonna look fetch
14:30
a copy of my list of photos and then
14:34
they're gonna look at the photos that
14:35
are on the list so client to maybe calls
14:39
get for my list and then looks down the
14:44
list and then calls get on that photo
14:48
maybe they see the photo I just uploaded
14:50
it on the list
14:51
and they're gonna do a get it for the
14:53
you know key for that photo yeah so this
14:57
is like totally straightforward code
14:59
looks like it ought to work but in an
15:02
eventually consistent system it's not
15:05
necessarily going to work and the
15:07
problem is that these two puts even
15:10
though the client did them in such an
15:12
obvious order first insert the photo and
15:15
then add a reference to that photo to my
15:17
list of photos the fact is that in this
15:20
event early consistent scheme that I
15:22
outlined this second put could arrive at
15:25
other data centers before the first put
15:29
so this other client if it's reading at
15:31
a different data center might see the
15:34
updated list with my new photo in it but
15:37
when that other client and another data
15:40
center goes to fetch the photo that's in
15:42
the list this photo may not exist yet
15:44
because the first right may not have
15:47
arrived over the wife of the client
15:50
Tuesday's so if this is just gonna be
15:56
routine occurrence in an eventually
15:57
consistent system if we don't sort of
16:01
think of anything more clever this kind
16:05
of behavior where it sort of looks like
16:07
the code out of work you know at some
16:09
intuitive level but when you actually go
16:11
and read the spec for the system which
16:13
is to say no guarantees you realize that
16:15
ah you know this obviously this correct
16:18
looking code may totally not do it I
16:20
think it's going to do these are often
16:23
called anomalies and
16:29
you know the way to think about it is
16:31
not necessarily that this behavior you
16:33
know you saw the list third on a list
16:35
but the photo didn't exist yet it's not
16:37
an error it's not incorrect because
16:39
after all the system never guaranteed
16:41
that this food was gonna do that it's
16:43
gonna actually yield the photo here so
16:46
it's not that it's incorrect it's just
16:50
that it's weaker than you might have
16:52
hoped so it's still possible to program
16:55
such a system and people do it all the
16:58
time and there's a whole lot of tricks
17:00
you can use for example you know a
17:03
defensive programmer might observe
17:05
programmer might might write code
17:07
knowing that well if you say something I
17:09
mean list it may not really exist yet
17:11
and so if you see a reference to a photo
17:13
in the list you get a photograph that's
17:14
not there you just retry just wait a
17:17
little bit and retry because by and by
17:19
the photo will probably show up and if
17:22
it doesn't we'll just skip it and don't
17:24
display it to the user so it's totally
17:27
possible to program in this style but we
17:31
could definitely hope for behavior from
17:33
the storage system that's more intuitive
17:35
than this that would make the
17:37
programmers lie life easier sort of we
17:40
could imagine systems that have fewer
17:42
anomalies then yes very simple
17:46
eventually consistent system okay before
17:50
I go on to talking about how to maybe
17:54
make the consistency a little bit better
17:56
I want to discuss something important I
17:59
left out about this current eventual
18:01
consistency system and that's how to
18:02
decide on which right is most recent so
18:07
for some data if data might ever be
18:12
written by more than one party there's
18:15
the possibility that we might have to
18:20
decide which data item is newer so
18:22
suppose we have some key or call okay
18:27
and to rights for it you know to clients
18:32
launch rates for K so when client writes
18:35
a value one another client writes a
18:36
value of two we need to set up a system
18:42
so that all three data centers agree on
18:46
what the final value of key K is because
18:50
after all we're at least guaranteeing
18:51
eventual consistency when the dust
18:53
settles all the data centers all have
18:55
the same data so you know data center
18:58
three is gonna get these two rights and
19:01
it's gonna pick one of them as the final
19:02
value for K well of course um datacenter
19:06
- sees the same rights right it sees its
19:08
own right so they're all seeing this
19:13
pair breaks and they all had better make
19:14
the same decision about which one that'd
19:18
be the final value regardless and the
19:20
order that they arrived in right because
19:22
we don't know you know the data center
19:26
three may observe these to arrive in one
19:28
order and some other data center may
19:29
observe them to arrive in a different
19:31
order we can't just accept the second
19:33
one and have that be the final value
19:35
meet a more robust scheme for deciding
19:38
what the final the most recent value is
19:42
for a key so we're gonna need some
19:46
notion of version numbers and the the
19:50
most straightforward way to sign version
19:52
numbers is to use the wall clock time so
19:58
so why not wall clock time and the idea
20:01
is that when a what a client generates a
20:04
put either it or the shard server the
20:07
local chart server talks to will look at
20:09
the current time oh it's you know it's
20:12
125 right now and it'll sort of
20:15
associate that time as a version number
20:17
on its version of the key so then we'd
20:20
annotate these these write messages
20:25
these actually both store the timestamp
20:28
in the database and annotate these write
20:30
messages sent between data centers with
20:33
the time so you know maybe this one was
20:35
written at 102 and this right occurred
20:37
at 103
20:41
and so if if 102 writer or suppose the
20:47
one of three right arrives first then
20:49
the data center three will put in its
20:51
database this key and the timestamp one
20:56
two three
20:56
and when the right for 102 arrives the
21:00
standard Center will say oh actually
21:01
that's an older right I'm just gonna
21:04
ignore this right because it has a lower
21:05
timestamp and the time step I already
21:08
have and of course if they arrive and
21:09
the other order did a sentence we would
21:11
have actually stored this right briefly
21:13
until the right with the higher
21:16
timestamp arrived but then it would
21:17
replace it I mean since everybody sees
21:19
it some time stamps at least you know
21:22
when they finally receive whatever all
21:24
these great messages over the Internet
21:26
they're all gonna end up with the
21:29
databases that have the highest numbered
21:32
values okay so this almost works and
21:39
there's two there's two little problems
21:42
with it one is that the two data centers
21:47
if they do writes at the same time may
21:49
actually assign this little time stamp
21:51
this is relatively easy to solve and the
21:54
way it's typically done is the time
21:57
stamps are actually pairs of time or
22:00
whatever and the High bits essentially
22:02
hand some sort of identifier could
22:04
actually be almost anything as long as
22:06
it's unique some sort of identifier like
22:08
the data center name or ID or something
22:11
in the low bits just to cause all pipe
22:18
signs from different data centers or
22:19
different servers to be you want and
22:22
then if two rights ride with the same
22:23
time in them from different data centers
22:26
are gonna have different low bits and
22:28
these little bits will be used to
22:30
disambiguate which of the two right says
22:33
is the lower timestamp and therefore
22:36
should yield to the other with the
22:40
higher okay so we're gonna stick some
22:44
sort of ID in the bottom bits and the
22:46
paper actually talks about doing this
22:47
it's very common the other problem is
22:50
that this system works okay if all of
22:53
the data centers are exactly
22:56
synchronized in time and this is
22:58
something a spanner paper stressed a
23:00
great length so if the clocks on all the
23:02
servers that all the data centers agree
23:04
and this is gonna this is gonna be okay
23:08
but if the clocks are off by seconds or
23:11
maybe even minutes then we have a
23:13
serious problem here one not so
23:16
important problem is that rights that
23:23
come earlier in time you know that
23:26
should be overwritten by later right so
23:28
it could be the rights that came earlier
23:30
in real time are because the clocks are
23:33
on are assigned high time stamps and
23:35
therefore not superseded by rights that
23:38
came later in time now we never made any
23:43
guarantees about this ID eventual
23:45
consistency and we never said oh rights
23:49
that come later in time we're gonna win
23:51
over rightfully the clients nevertheless
23:58
we don't want to be already weak enough
24:02
consistency we don't want to have it
24:04
have needlessly
24:06
strange behavior like users really well
24:09
notice they eat something and then they
24:11
updated later doesn't seem to take
24:14
effect because the earlier update was
24:16
assigned two times thinks it's too large
24:19
in addition if some servers clock is too
24:23
high and it doesn't right you know if
24:26
it's caucus say a minute fast then it'll
24:29
be a whole minute when no other we have
24:37
to wait for all the service thoughts to
24:38
catch up to the minute fast servers
24:41
clock before anybody else can do the
24:43
write about heat in order to solve that
24:46
problem one way to solve that problem is
24:51
this idea called Lamport clocks the
24:54
paper talks about this
24:57
although the paper doesn't really say
24:59
why they use the lamp or clocks I'm
25:00
guessing it's at least partially for the
25:02
reason I just I don't mind
25:05
Flambeau clock is Wade assigned time
25:08
stamps that are related to real time but
25:12
which hope would this problem at some
25:14
servers having clocks that are running
25:16
too fast so every server keeps a value
25:22
called this T max which is the highest
25:26
version number it seems so far from
25:28
anywhere else so
25:34
so if somebody else is generating
25:36
timestamps that are you know ahead of
25:37
real-time you know the other servers to
25:40
see this timestamps their team axes will
25:41
reflect ahead of real time and then when
25:46
a server needs to assign a timestamp of
25:48
version number to a new put the way it
25:52
will do that is it'll take the max of
25:56
this team ax plus one and the wall clock
26:03
time
26:05
the real-time so that means that new
26:10
version number so this is the version
26:13
numbers that we need to accompany the
26:16
values in our venturi consistent system
26:18
so each new version number is going to
26:20
be higher than the highest version
26:22
number seen so higher than whatever the
26:26
last Rite was for example to the data
26:28
that we're updating and at least as high
26:31
as real-time so if nobody's clock is
26:35
ahead this Tmax plus one will probably
26:37
actually be smaller than real time and
26:38
the time stamp tool end up in real time
26:41
if some server has a crazy clock that's
26:43
too fast then that will cause all other
26:46
servers all the ones it's updates
26:48
advanced the team axes so that when they
26:51
allocate new version numbers higher than
26:53
the version number of whatever ladies
26:56
rate they saw from the server whose
26:58
clock is too fast okay so this is
27:03
Lampert clocks and this is how the paper
27:08
assigns version numbers come up all the
27:11
time in distributed systems alright so
27:16
another problem I want to bring up about
27:18
our eventually consistent system is the
27:25
problem of what to do about concurrent
27:27
rates to the same key it's actually even
27:32
worse the possibility that concurrent
27:35
rates might carry might both carry
27:38
important information that ought to be
27:40
preserved so for exam a client of both
27:49
did both of these you know different
27:51
clients client one and client - they
27:54
both issue a put to the same key
28:04
and both of these let's get sent it to
28:09
datacenter 3 the question is what's a
28:14
data center 3 do about the information
28:16
here and the information here this is a
28:24
real puzzle actually there's not a good
28:25
answer
28:26
the what the paper uses is the last
28:28
Raider wins that is datacenter 3 he's
28:30
gonna look at the version number that is
28:32
signed here and the version number is
28:33
assigned here one of them will be higher
28:35
slightly later in time data center ID
28:39
and a little bit higher or something I'm
28:41
a datacenter 3 you will simply throw
28:44
away the data with the lower timestamp
28:47
and accept the data with the higher
28:49
pants that stampin and that's it so it's
28:51
using this last Raider wins policy and
28:58
in that has the virtue that it's
29:03
deterministic and everybody's gonna get
29:05
the same answer you can think of
29:13
examples in which it's people so you
29:17
know for example supposing what these
29:18
puts are trying to do is increment a
29:20
counter so these clients both saw near
29:24
the counter with value 10 they both had
29:27
one and maybe we've put 11 right and but
29:30
you know what we really wanted to do is
29:32
have them both increment the counter and
29:33
have it had value 12
29:35
so in that case last Raider wins is
29:37
really not that great what we really
29:39
would have wanted was for datacenter 3
29:41
to sort of combine this increment and
29:43
that increment end up with the value of
29:46
12 so you know these systems are really
29:53
generally powerful enough to do that but
29:58
we would like better what we'd really
30:01
like is more sophisticated conflict
30:05
resolution
30:08
you
30:10
and the way other systems we've seen saw
30:13
this the most powerful system to support
30:15
real transactions so instead of instead
30:20
of just having put in get it actually
30:22
has increment operators that do atomic
30:24
transactional increments increments
30:27
weren't lost and that's sort of a
30:30
transactions of maybe the most powerful
30:32
way of doing resolving conflicting
30:35
updates we've also seen some systems
30:38
that support a notion of mini
30:39
transactions where at least on a single
30:43
piece of data you can have atomic
30:44
operations like atomic increment or
30:48
atomic test and set you can also imagine
30:52
wanting to have a system that does come
30:55
sort of custom conflict resolution so
30:57
exposing this value that we're keeping
30:59
here is a shopping cart you know with a
31:01
bunch of items in it and our user may
31:04
because they're running you know two
31:05
windows in the wind browser
31:06
adds two different items to their
31:08
shopping cart from two different web
31:11
servers
31:11
we'd like these two conflicting writes
31:14
to the same shopping cart to resolve
31:16
probably by taking set union of the two
31:19
shopping carts and ball instead of
31:21
throwing one away and accepting the
31:24
other I'm bringing this up
31:30
satisfying solution indeed the paper
31:33
doesn't really propose much of a
31:35
solution it's just a drawback of weekly
31:38
consistent systems that it's easy to get
31:43
into a situation where you might have
31:45
conflicting rights to the same data that
31:47
you would like to have sophisticated
31:50
sort of application specific resolution
31:52
to but it's generally quite hard and
31:55
it's just like a thorn in people's sides
31:58
that has to be lived with typically and
32:04
that that goes for both the eventual
32:05
consistency my straw man here hand for
32:07
the and for the P versus the paper dance
32:10
and a couple of paragraphs it it could
32:12
be used to do better
32:13
they don't really explore that because
32:15
it's difficult okay back to eventual
32:21
consistency my straw man system if you
32:26
recall it had a real problem with even
32:32
very simple this very simple scenario I
32:36
know we did put a photo and put a photo
32:38
list and then somebody else in a
32:40
difference and it reads the new this but
32:42
when they read the photo they find
32:44
there's nothing there so can we do
32:47
better can we build a system of it's
32:49
still allows local reads and local
32:52
rights but is has slightly mini-lesson
32:59
Thomas I'm going to propose one that's
33:03
strong me into this kind of one step
33:05
closer to papers so this is straw man
33:08
too
33:12
and
33:14
in this scheme I'm gonna propose a new
33:19
operator not just put and get but also a
33:21
sink operator that clients can use and
33:24
the sink operator will do will be the
33:29
key and a version number and what sink
33:34
does when a client calls it is sink
33:36
waits until all data centers copies of
33:40
key K are at least up to date as of the
33:44
specified version number so it's a way
33:47
of forcing order the client can say look
33:50
I'm gonna wait as well everybody knows
33:51
about this value and I wanna only see
33:54
after every one everything is Center
33:56
knows about this value and in order for
33:58
clients to know what version numbers to
34:02
to pass the sink we're gonna change the
34:06
put call a bit so that you say put key
34:08
value and put returns the version number
34:14
of the of this updated K you could call
34:21
this this sink is asking that acting is
34:23
a sort of a barrier offense we could
34:28
call this eventual consistency plus
34:36
barriers see calls the barrier
34:46
I'm gonna talk about how to users in a
34:48
moment but just keep in mind this thing
34:49
calls likely to be pretty slow because
34:51
the natural implementation of it is that
34:54
it actually goes out and talks to all
34:56
the other data centers and asked them
34:58
you know is your version of key pay up
35:01
to at least you know this version number
35:03
and then have to wait booth for the data
35:06
centers to respond and if any of them
35:07
say no it's got to then read that data
35:10
yes all right so how would you use this
35:13
well again for our photo list
35:16
now maybe client one that's updating
35:19
photos it's going to call put to insert
35:22
the photo it's gonna get a version
35:23
number now you know the programmer now
35:31
has to there's a danger here that update
35:36
the photo this but you know what if some
35:39
other data center you know hasn't seen
35:40
my photo yet so then the programmer is
35:43
gonna say sync and you're gonna sink the
35:48
photo wait for all data centers to have
35:51
that version number that was returned by
35:53
put and only after the sync return will
35:56
client one call put update update the
36:00
photo list and now if client two comes
36:03
along I must read the photo list and
36:05
then to be in the photo you know who
36:07
knows client two is going to do a get of
36:09
the photo list let's say time is
36:15
passing the same for them it's gonna do
36:18
a get at the photo list and if it sees
36:20
the photo on that list it'll do a get
36:23
you know again in its local data center
36:25
of the photo and now we're actually in
36:29
much better situation if client to in a
36:32
different data center saw the photo in
36:36
this list then that means that client
36:43
want had already called put on this list
36:45
because it's this put that adds the
36:48
footage of the list if client wanted
36:50
already called put on this list that
36:51
means the client one now given the way
36:54
this code works right had already called
36:56
sink and sink doesn't return until the
36:58
photo is present at all data centers so
37:01
that means that client two can the
37:03
programmer for client two can rely on
37:06
well the photos in the list that means
37:09
this whoever added the footage to the
37:11
list their sync completed and the fact
37:16
that this thing completed means the
37:17
photo is present everywhere and
37:18
therefore we can rely on this get photo
37:21
actually returning the photograph ok so
37:28
this works and it's actually reasonably
37:32
practical it does require fairly careful
37:37
thought on the part of the program or
37:38
the programmer you know has to think aha
37:40
I need a sink here I need to put sink
37:43
put in order for things to work out
37:45
right the reader for the readers much
37:48
faster but the reader still needs to
37:50
think oh you know I'm gonna at least
37:52
tested that the programmer has to you
37:56
know check with it if he's if the
37:58
programmer does a get list and then I
37:59
get photo from that list that uh you
38:01
know verify that indeed the code
38:03
modified the list called sink before
38:05
adding some things for list that is
38:08
quite a bit of thought
38:11
with this is all about the sink cause
38:13
all about is enforcing order make sure
38:15
that this completely finishes before
38:17
this happens the readers so that sink
38:22
and sort of explicitly forces order for
38:24
writers readers also have to think about
38:26
order the order is actually obvious in
38:28
this example but it is true that if the
38:31
writer did put a put then sink and then
38:34
put of a second thing then almost always
38:36
readers need to read the second thing
38:38
and then read the first thing because
38:41
guarantees you get out of this out of
38:43
this sink scheme these barriers is that
38:46
if a reader sees the second piece of
38:49
data then they're guaranteed to also see
38:51
the first piece of data so that means
38:53
the reader sort of need to be the second
38:55
piece of data first and then and then
38:57
the first item of data okay so there's a
39:03
there's a question about fault tolerance
39:05
mainly at if one data center goes down
39:07
that means the sink blocks until the
39:09
other data centers brought up
39:10
that's absolutely right so you're
39:12
totally correct this is a this is not a
39:16
great scheme alright this is sort of a
39:19
straw man on the way the cops this sink
39:21
called would block the way this actually
39:24
that sort of version of this that people
39:26
use in the real world to avoid this
39:30
problem will you know whatever data
39:31
centers down will the sink block forever
39:33
is that puts and gets both actually
39:37
consult a quorum of data center so that
39:41
this a sink will only wait for you know
39:45
say a majority year of data centers to
39:48
acknowledge that they have the latest
39:50
version of the photo and it get will
39:52
actually have to consult an overlapping
39:55
majority of data centers to in order to
39:58
get the data so things are not really
40:01
real versions of this are not perhaps as
40:03
rosy aasaiya as I may be implying again
40:08
the the systems that work in this way is
40:14
you're interested it's dynamo
40:16
and Cassandra and they use quorums to
40:21
avoid the wall talks pop okay okay so
40:28
this is a straightforward design and has
40:30
decent semantics even though it's slow
40:32
and this as you observe not very fault
40:34
tolerant the read performance is
40:36
outstanding because the reads are still
40:38
for local at least if we if the quorum
40:41
setup is read one write all and the
40:45
write performance is not great but it's
40:47
okay if you don't write very much or if
40:49
you don't mind great waiting and the
40:52
reason why you can maybe convince
40:53
yourself that the rate performance is
40:55
not a disaster is that after all the
40:58
Facebook memcache D paper has to send
41:01
all rights through the primary data
41:02
center so yeah you know Facebook runs
41:04
multiple data centers and clients talk
41:07
to all of them but the rates have to all
41:08
be sent to the my sequel databases at
41:11
the one primary data center
41:13
similarly spanner writes have to wait
41:15
for a majority of replica sites to
41:18
acknowledge the rights before the
41:19
clients thought to proceed so the notion
41:21
that clients might have dis rule that
41:23
the rights might have to wait to talk to
41:25
other data centers in order to allow the
41:27
reads to be fast is does not appear to
41:30
be outrageous in practice mom still I
41:35
you know you might like to nevertheless
41:38
have a system that does better than this
41:39
to somehow have semantics of sync that
41:42
sort of or sync is forcing this put
41:46
definitely appears to everyone to happen
41:48
before the second put you might like to
41:49
have that and without the cost so we'll
41:54
be interested in systems and this is
41:56
starting to get close to what cops does
41:58
interested in systems in which instead
42:00
of sort of forcing the clients to wait
42:02
at this point we somehow just encode the
42:04
order as a piece of information that
42:06
we're going to tell the readers or tell
42:08
the other data centers and a simple
42:16
do that which the paper mentions as a
42:21
non scalable implementation is that at
42:24
each data center so this is a logging
42:27
approach at each data center instead of
42:31
having the different shard servers talk
42:34
to their counterparts in other data
42:36
servers sort of independently instead at
42:42
every day the center we're gonna have a
42:43
designated log server that's in charge
42:47
of communicating of sending writes to
42:49
the other data center so that means if a
42:52
client does it right does it put to its
42:56
local shard and that's chart that shard
42:59
instead of just sending that the data
43:00
out sort of separately to the other data
43:02
centers will talk to his local log
43:07
server and append the right to the one
43:11
log that this data center is
43:13
accumulating and then if a client say
43:15
does a write to a different key maybe
43:18
we're writing key a and key B here
43:22
instead of again instead of this shard
43:25
server sending the right to key B sort
43:27
of independently it's gonna tell the
43:31
local log server to append the right to
43:34
the log and then the log server send out
43:37
their log to the other data centers in
43:40
log order so that all data centers are
43:45
guaranteed to see the right to a first
43:46
and they're gonna hope process that rate
43:49
to a first and then all data centers are
43:52
going to see our right to B and that
43:53
means if a client does a right to a
43:54
first and then does it right to be the
43:57
writes will show up in that order and it
43:59
in its log a and B and they'll be sent
44:01
the right to a first and then the right
44:04
to be to each of the t's data centers
44:06
and they probably actually have to be
44:08
sent to a kind of single log receiving
44:11
server which plays out the rates one at
44:15
a time as they arrive in log order so
44:19
this is the logging strategy if the
44:20
paper criticizes
44:23
it's actually regain some of the
44:26
performance we want because now clients
44:29
we're no longer we now eliminate the
44:31
sinks the clients can go back to this
44:33
going put of a and then put B and they
44:38
client puts can return as soon as the
44:41
data is sitting in the log at the local
44:43
log server so now client puts and gets
44:46
are now quite fast again
44:48
but we're preserving order sort of by
44:51
basically through the sequence numbers
44:54
of the entries and the logs rather than
44:56
by having the clients wait so that's
44:59
nice we get the order you know now we're
45:01
forcing ordered writes and we're causing
45:05
the rights to show up in order at the
45:08
other data center so that reading
45:09
clients will see them in order and so
45:11
our my example application might
45:13
actually work them out with this scheme
45:17
the drawback that the paper points to
45:20
about this style of solution is that the
45:24
log server now all the rights have to go
45:26
through this one log server and so if we
45:29
have a big big database with maybe
45:31
hundreds of servers serving at least in
45:34
total a reasonably high workload the
45:37
right workload all the rights have to go
45:39
through this log server and possibly all
45:42
the rights have to be played out through
45:45
a single receiving log server at the far
45:47
end and a single log server as the
45:50
system grows there get to be more and
45:52
more shards a single log server may stop
45:54
being fast enough to process all these
45:57
rates and so cops does not follow this
46:01
approach to conveying the order
46:06
constraints to other data centers
46:10
okay so we want to build a system that
46:15
can at least from the clients point of
46:17
view process rates and reads purely
46:18
locally we don't want to have to wait
46:20
you don't want clients to wait in order
46:22
to get order we want a forward we like
46:25
the fact that these rates are being
46:28
forward asynchronously but we somehow
46:31
want to eliminate the central log server
46:36
so we want to somehow convey
46:37
order information to other data centers
46:39
without having to funnel all our rates
46:41
through a single logs alright so now
46:45
that brings us to what cops is actually
46:48
up to so when I can talk about now is
46:51
starting to be what cops does and what
46:54
I'm talking about though is the non GTE
46:56
version of cops cops without get
47:00
transactions ok so the cops is the basic
47:09
strategy here is that when cops clients
47:11
read and write locally they accumulate
47:14
information about the order in which
47:16
they're doing things that's a little
47:19
more fine-grain than the logging scheme
47:21
and that information is sent to the
47:24
remote data centers whenever a client
47:26
doesn't quote so this we have this
47:29
notion of client context and as a client
47:34
does get some puts you know maybe a
47:36
client as a get of X and then I get it Y
47:42
and then a put of Z with some value
47:49
the context the library that the client
47:55
uses that implements Budhan get is going
47:57
to be accumulated in this context
47:59
information on the side as the put sand
48:01
gets occur so if a client doesn't get
48:05
and that yields some value with version
48:07
2 I'm just going to save that as an
48:11
example maybe get returns the current
48:13
value of x and that current value House
48:14
version 2 and maybe Y returns the
48:20
current value of version for what's
48:22
going to be accumulated in the context
48:25
is that this client has read X and a got
48:32
version 2 then after the get for why the
48:37
cops client libraries gonna add to the
48:39
context so that it's not just we've read
48:43
X and gotten version 2 but also now
48:44
we've read Y and gotten version 4 and
48:47
when the client does a put the
48:53
information that's sent to the local
48:57
shard server is not just put key and
49:02
what if the value is but also these
49:06
dependencies so we're going to tell the
49:08
local shard server for Z that this
49:12
client has already read before doing the
49:15
put X and got version 2 and Y and got
49:19
version 4
49:23
and you know what's going on here is
49:26
that we're telling where the client is
49:28
expressing this ordering information
49:30
that this put to Z now the client had
49:34
seemed X version 2 and Y version 4
49:37
before doing the foot so anybody else
49:39
who reads this version of Z had also
49:41
better be seeing X&Y with the Beasties
49:44
versions and similarly if the client
49:48
then does a put of something else say Q
49:54
what's going to be sent to the local
49:57
shard server is not just the Q and this
50:01
but also the fact that this client had
50:05
previously done some gets input so let's
50:08
suppose this put yields version 3 you
50:12
know the local shard server says a high
50:14
assigned version three to your new value
50:16
for Z then when we come to do the quit
50:19
of Q is going to be accompanied with
50:21
dependency information that says this
50:22
put comes after this put of Q comes
50:25
after the put of Z that created Z
50:29
version three and at least notionally
50:33
the rest of the context
50:38
ought to be passed as well although
50:40
we'll see that for various reasons cops
50:46
can optimize away this information and
50:49
if there's a proceeding put only sends
50:51
the version information for the point so
50:56
the question is is it important for the
50:58
context to be ordered I don't believe so
51:03
I think I think it's sufficient to treat
51:09
the context or at least the information
51:10
that's sent in the put as just a big bag
51:15
of dependencies for at least four
51:20
non-transactional cops okay so the
51:25
clients are community this context and
51:27
basically send the context with each put
51:29
and the context is encoding this order
51:34
information that in my previous straw
51:36
man straw man 2 was sort of forced by
51:39
sink instead of doing that we're not
51:41
waiting for accompanying these puts with
51:43
oh this put needs to come after these
51:47
previous values and this put needs to
51:48
come after these previous values cops
51:56
calls these relationships that this put
51:58
needs to come after these previous
52:00
values of dependency
52:05
and dependency and it writes it as
52:12
supposing this foot produces Z version
52:14
three we express it as really there's
52:20
two actually two dependencies here one
52:23
is that X version two comes before Z
52:28
version three and the other is that Y
52:32
version four comes before Z version
52:38
three and these are it's just definition
52:42
or notation that the paper uses to talk
52:46
about these individual pieces of order
52:50
information that cops needs to enforce
52:53
alright so so then what is this what is
52:58
this dependency information this passed
52:59
to the local shard server what does that
53:02
actually cause cops to do well eats cops
53:08
shard server when it receives a put from
53:12
a local client first it assigns the new
53:15
version number then it stores the new
53:18
value you know it stores for Z this new
53:21
value along with the version number that
53:22
it long version number that allocated
53:30
and then it sends the whole mess to each
53:33
of the other data center so at least
53:35
some non GT cops the local shard server
53:38
only remembers the key value and latest
53:42
version number doesn't actually remember
53:43
the dependencies and only forwards them
53:45
across the network to the other data
53:46
centers so now the position were in is
53:53
that let's say we had a client produced
53:57
a put of Z and some value it was
54:03
assigned version number v3 and it had
54:08
these dependencies
54:13
XV 2 + a YB 4 right and this is sent
54:19
from datacenter 1 let's say to the other
54:22
data center so we got a datacenter 2 and
54:25
datacenter 3 both receive this now in
54:29
fact this information is sent from The
54:31
Shard server for ze so there's lots of
54:34
shard servers but only the shard for Z
54:37
is involved in this
54:40
so here datacenter 3 the shirts are for
54:43
Z is going to receive this put from sent
54:49
by the client short server forwards it
54:56
this short server the the with this
55:00
dependency information that you know X V
55:02
2 and Y before come before Z B 3 but
55:04
that really means is operationally is
55:06
that this new version of Z can't be
55:10
revealed to clients until its
55:14
dependencies these versions of x and y
55:17
have already been revealed to clients in
55:22
datacenter 3 so that means that the
55:24
shard server visi must hold this right
55:27
must delay applying this right to Z
55:30
until it knows that these 2 dependencies
55:33
are visible in the local data center so
55:35
that means that Z has to go off let's
55:38
say the you know we have these shard
55:40
server for X and the shorts are for Y
55:43
Z's gotta actually send a message to the
55:45
shard server for X and the shard server
55:47
for Y saying you know what's the version
55:49
number for a current version for a
55:52
number for x and y and has to wait for
55:55
the result if both of these shards
55:58
servers say oh you know they give a
55:59
version number that's 2 or higher or 4
56:01
or higher for Y then Z can go ahead and
56:03
apply to put to its local table of data
56:09
however you know maybe these two shard
56:13
servers haven't received the updates
56:16
that correspond to version 2 of
56:17
extraversion for Y and that KC has to
56:20
hold on to this update the shard server
56:22
is he has to hold on to this update and
56:23
tell
56:25
the indicated versions of X or Y ever
56:27
actually arrived and been installed on
56:29
these two short servers so there may be
56:32
some delays now and only after these
56:35
dependencies are visible at datacenter 3
56:37
only then can the shards of a4z go ahead
56:40
and write updated stable for Z to have
56:43
version 3 ok and what's what that means
56:52
of course is that if a client the
56:54
datacenter 3 does a read for Z and sees
56:56
version 3 then because he already waited
56:59
that means if that client then reads X
57:01
or Y it's guaranteed to see at least
57:04
version 2 of X and at least version 2 of
57:06
Y because he didn't reveal the shards or
57:10
didn't reveal Z until it was sure the
57:13
dependencies would be visible ok so
57:19
question what if x and y never get their
57:20
values perhaps due to a network
57:22
partition with the Z shard block forever
57:24
yeah the um the semantics require the Z
57:29
shard to block forever that's absolutely
57:34
true so you know there's certainly an
57:36
assumption here that well they're you
57:40
know two ways that that might turn out
57:42
ok one is somebody repairs the network
57:45
or repairs whatever was broken and x and
57:48
y do eventually get their updates that
57:50
be one way to fix this and then z will
57:52
finally be able to apply the update
57:53
might have to wait a long time the other
57:57
possibility is maybe the data center is
57:59
entirely destroyed
58:00
you mean the building burns down and so
58:02
we don't have to worry about this at all
58:04
but it does point out a problem that's
58:07
real criticism of causal consistency and
58:13
that's that these delays can actually be
58:18
quite nasty because you can imagine oh
58:21
you know Z is waiting for the correct
58:23
value for X to arrive you know even if
58:25
there's no failures and nothing burns
58:26
down even mere slowness can be
58:29
irritating Z mate you have to wait for X
58:31
to show up well it could be that X has
58:34
already showed up and has arrived at
58:38
this charts
58:38
but it itself had dependencies maybe on
58:41
key a and so this chart server can't
58:43
install it until the update for a
58:45
arrives because X this put of X depended
58:49
on some key a and Z still has to wait
58:52
for that because what Z's waiting for is
58:54
for this version of X to be visible to
58:57
client so it has to be installed so if
58:59
it's arrived if the update for X is
59:00
arrived but itself is waiting for some
59:02
other dependency then we may get these
59:05
cascading dependency Waits and in real
59:09
life actually these you know these
59:11
probably would happen and it's one of
59:15
the problems that people bring up and
59:19
you know against calls are consistency
59:21
when when you try to persuade them it's
59:24
a good idea this problem of cascading
59:27
delays so that's too bad um although on
59:30
that note it is true that the authors of
59:33
the cops paper have a follow on P
59:37
actually a couple of interesting
59:38
follow-on papers but one of them has
59:41
some mitigations for this cascading
59:43
weight problem okay so for a photo
59:47
example this is the scheme this cop
59:50
scheme will actually solve our photo
59:51
example and the reason is that you know
59:53
this put we're talking about is the put
59:55
for the photo list the dependencies is
59:58
gonna have and its dependency list is
60:00
the insert of the photo and that means
60:05
that when the put for the photo Willis
60:06
arrives at the remote site the remote
60:10
chard server is essentially going to
60:11
wait for the photo to be inserted and
60:13
visible before it updates the photo list
60:16
so any client in a remote site that is
60:19
able to see the new photo of the updated
60:22
photo list is guaranteed to be able to
60:25
see the photo as well so this cop scheme
60:28
fixes the photo and photo list example
60:33
this the scheme the cops is implementing
60:39
is is usually called causal consistency
60:52
so there's a question is it's off to the
60:56
programmer to specify the dependencies
60:58
no it turns out that though that context
61:01
information this context information
61:04
that's accumulated here the cops client
61:08
library can accumulate it automatically
61:11
so the program only does gets and puts
61:15
and may not even need to see the version
61:23
numbers so simple program we just do
61:25
gets inputs and internally the cops
61:28
library maintains these contexts and
61:30
adds this extra information to the put
61:32
our pcs so that the programmer just just
61:38
does get some puts and system kind of
61:41
automatically tracks the dependency
61:43
information so that's very convenient I
61:47
mean just a you know pop up a level for
61:51
a moment you know we now built a system
61:53
that's that is as semantics powerful
61:57
enough to make the photo example code
62:00
work out correctly to have it sort of
62:02
had the expected result instead of
62:04
anomalous results and at least arguably
62:08
it's reasonably efficient because nobody
62:10
was you know the client never has to
62:12
wait for rights to complete there's none
62:14
of this sink business and also the
62:16
communication is mostly independent
62:19
there's no central log server so
62:20
arguably this is both reasonably high
62:22
performance and has reasonably good
62:26
semantics reasonably good consistency so
62:31
the the consistency that this design
62:36
produces is usually called causal
62:38
consistency and it's actually a much
62:42
older idea than this paper there's been
62:45
a bunch of call so consistency schemes
62:48
before this paper indeed a bunch of
62:50
follow-on work so it's a treating idea
62:53
that people like a lot
62:55
what causal consistency is what it sort
62:58
of means and here I am putting up I
63:02
think a copy of figure two from the
63:03
paper
63:05
the sort of what the definition says is
63:07
that the clients actions induce
63:10
dependencies so there's two ways that
63:15
dependencies are induced one is if a
63:17
given client there's a put and then I
63:21
get does it get and then a PUD or a put
63:24
and then a put then we say that the the
63:28
put depends on the previous put or get
63:31
so that in this case put of why if two
63:34
depends on the put of X of one so that's
63:37
one form of dependency another form of
63:41
dependency if is if one client frees a
63:43
value out of the storage system then we
63:46
say that that the get that that second
63:49
client issued depends on the
63:51
corresponding put that actually inserted
63:53
the value from a previous client and
63:55
furthermore we say that the dependency
63:59
relationship is transitive so that you
64:03
know this put depends on that get this
64:06
get by client two depends on the put by
64:09
client one and by transitivity in
64:11
addition we can conclude that the client
64:15
two's get depends on client ones gift
64:18
and so that means that this last put of
64:21
by client three for example depends on
64:23
all of these previous operations so
64:31
that's a definition of causal dependency
64:35
and then a causally consistent system
64:38
says that says that if through the
64:45
definition of dependency I just outlined
64:46
a depends on B sorry B depends on a and
64:51
a client reads B then the client must
64:56
subsequently also see a the dependency
64:59
so if client ever sees through a second
65:03
of two ordered operations operations
65:05
ordered by dependency and the client is
65:07
also then after that guaranteed to be
65:09
able to see the everything that that
65:13
operation depended
65:17
you know so that's the definition and
65:19
it's you know in a sense kind of
65:21
directly derived from what the system
65:25
actually does so this is very nice when
65:29
updates are causally related that is if
65:32
yeah you know these clients and in some
65:35
sense they're talking to each other you
65:36
know indirectly through the storage
65:37
system and so the clients are I kind of
65:39
wear that of you know if we somebody
65:42
reads this value and c5 sees five and
65:45
inspects the code you know they can
65:46
conclude that really really you know
65:48
this there's a sense in which this put
65:51
definitely must have come before this
65:55
last put and so if you see slash but you
65:56
really gosh you really just deserve to
65:58
see this first put so in that sense
66:00
causal consistency gives you this
66:02
programmers kind of a sort of well
66:08
behaved visas allows them to see well
66:13
behave values coming out of the storage
66:14
system another thing that's good about
66:18
causal consistency is that when it
66:20
updates when two values in the system
66:23
are not two updates are not causally
66:25
related the causal consistency system
66:28
you know the cops storage system has no
66:31
obligation is about maintaining order
66:33
between updates that are not causally
66:35
related so for example if I mean that's
66:41
good for performance so example if we
66:43
have you know client one does a put of X
66:46
and then I put Z and then around the
66:49
same time client two does a put of why
66:53
there's a you know there's no causal
66:55
relationship between these and therefore
66:58
you know sorry there's no causal
67:00
relationship between the put of Y and
67:02
any of the actions of client one and so
67:04
the cops is allowed to do all the work
67:09
associated with the put of Y completely
67:11
independently for client ones puts and
67:14
the way that plays out is that it's done
67:16
and the put of Y is sort of entirely
67:19
happens in the servers that for the
67:22
shard of why these two puts
67:24
are only involve servers for the shards
67:29
that X and Z are in it may require some
67:31
interaction here because the remote
67:35
servers for Z may have to wait for this
67:37
put to arrive not but they don't have to
67:39
talk to the servers that are in charge
67:41
of of Y so if that's a sense in which
67:44
causal consistency allows good allows
67:47
parallelism and good performance and you
67:51
know this is different from potentially
67:54
from linearizable systems like Anna
67:55
linearizable system the fact that this
67:57
put Y came after the put of X in real
67:59
time actually imposes some requirements
68:01
on the linearizable storage system but
68:04
there's no such requirements here for
68:06
causal consistency and so you might be
68:09
able to build a causal consistency
68:11
causally consistent system that's faster
68:13
than a linearizable system okay there's
68:18
a question would cops gain any more
68:20
information by including puts in the
68:22
client context okay so it's this may be
68:26
a reference to the today's lecture
68:28
question it is the case so why don't I
68:33
explain the answer for the electric
68:36
question the if a client does get of X I
68:45
mean look at its context does the get of
68:49
X maybe and then put to Y and then a
68:53
quit to Z in the context initially is X
68:58
version something you know that when we
69:02
client since the puts of the server it's
69:03
gonna include this context along with it
69:06
but in the actual system there's this
69:08
optimization that after a put the
69:13
context is replaced by simply the
69:20
version number for the put and any
69:22
previous stuff in the context like
69:24
namely this information about X is a
69:26
race from the from the clients context
69:28
so it only includes after put the
69:30
context is just replaced with
69:33
number returned from the put that so
69:37
isn't this returns you know version
69:38
version seven of why and the reason why
69:44
this is correct and doesn't lose any
69:47
information for the non-transactional
69:49
cops is that for this when this put is
69:57
sent out to all the remote sites the put
70:01
is accompanied by X version whatever in
70:04
the dependency list so this put won't be
70:06
applied until at all and each data
70:10
center until this X is also applied so
70:13
then when if the client then does this
70:16
put right what this turns into is sent
70:18
to other data centers is really a put
70:20
with Z and some value and the dependency
70:24
is just Y version seven all the other
70:27
data centers are going to wait for
70:29
they're gonna check before applying Z
70:32
they're gonna check that Y version seven
70:34
has been applied at their data center
70:36
well we know the Y version seven won't
70:39
be applied at their data center until X
70:41
version whatever is supplied at that
70:43
data center so there's sort of a
70:44
cascading delays here where that is
70:48
telling other data centers to wait for Y
70:50
version seven to be installed implies
70:53
that they must also already be waiting
70:55
for whatever Y version seven depended on
70:59
and because of that we don't need to
71:01
also include X version the X version and
71:06
this dependency list because those data
71:08
centers will already be waiting for it
71:10
that version of X so the answer the
71:12
question is no cops call the
71:15
non-transactional cops doesn't need to
71:18
have anything doesn't need to remember
71:20
the gets in the context after it's done
71:23
put
71:27
all right a final thing to note about
71:32
this scheme is that cops only see
71:36
certain relationships it's only aware of
71:39
certain causal relationships that is it
71:45
only you know cops is aware that if a
71:48
single client thread does a put and then
71:50
another put client you know cops record
71:53
so this the second put depends on the
71:55
first book furthermore cops is aware
71:58
that oh what a client does a read of a
72:01
certain value that it's depending on I'm
72:05
the one to put the created that value
72:07
and therefore depending on anything that
72:09
that depended on so you know cops is
72:11
directly aware of these dependencies
72:13
here however it could it is often the
72:18
case that causality in the larger sense
72:21
is conveyed through channels that cops
72:24
is not aware of so for example you know
72:29
if client one does a put of X and then
72:34
the human you know who's controlling
72:38
client one calls up client to on the
72:41
telephone or it's you know email or
72:44
something that says look you know I just
72:45
updated the database with some new
72:47
information why don't you go look at it
72:48
right and then client to you know does
72:52
it get of X sort of in a larger sense
72:57
causality would you know
73:00
suggest the client to really ought to
73:01
see the updated X because client to new
73:04
from the telephone call that X had been
73:07
updated and so if cops had known about
73:11
the telephone call it would have
73:12
actually included the it would have
73:19
actually caused the extra sorry if the
73:27
telephone call had been itself a put
73:28
right you know it would have been a put
73:31
of telephone call here and I get of
73:33
telephone call here and if this get had
73:36
seen that put cops would know enough to
73:39
arrange that this get would see that put
73:41
but because cops was totally unaware of
73:44
the telephone call there's no reason to
73:47
expect that this get would actually
73:48
yield the put value so cops is sort of
73:55
enforcing causal consistency but only
73:57
for the sources the kinds of causation
74:00
the cops is directly aware of and that
74:04
means that the sense in which cops is
74:08
causal consistency sort of eliminates
74:10
anomalous behavior well it only
74:13
eliminates anomalous behavior if you
74:17
restrict your notion of causality to
74:19
what cops can see it in the larger sense
74:21
you're going to still see odd behavior
74:23
you definitely going to see situations
74:24
where you know someone believes that a
74:27
values been updated and yet they do not
74:29
see the updated value that's because
74:31
their belief was caused by something
74:32
that cops wasn't aware of all right
74:41
another potential problem which I'm not
74:46
gonna talk about is that the remember
74:51
for the photo example with the photo
74:52
list there was a particular order of the
74:55
adding a photo and that particular
74:57
different order of looking at photos
74:59
that made the system work with causal
75:02
consistency as we're definitely relying
75:04
on the there being sort of if the reader
75:08
reads the photo list and then reads the
75:10
photo in that order that the fact that a
75:13
photos refer
75:14
joana photo list means that the read of
75:16
the photo will succeed it is however the
75:19
case that there are situations where no
75:22
one order of reading or combination of
75:25
orders of reading or writing will cause
75:30
sort of the behavior we want and that
75:32
but this is leading into transactions
75:34
which I'm not gonna have time enough to
75:36
explain but at least I want to mention
75:38
the problems the paper set up so
75:42
supposing we have our photo list and
75:47
it's protected by an access control list
75:50
and an access control this is basically
75:52
a list of usernames that are allowed to
75:55
look at the photos on my list
75:57
does that means that the software that
76:01
implements these photo lists with access
76:04
control this needs to be able to you
76:07
know read the list and then read the
76:09
access control list and see if the user
76:11
trying to do the read is in the access
76:13
control list and however neither order
76:17
of getting the access control list and
76:19
the list of photos works out so if the
76:22
client code first gets the access
76:24
control list and then gets the list of
76:29
photos that order actually doesn't
76:32
always work so well because supposing my
76:36
client Reesie access control list and
76:38
sees that I'm on the list but then right
76:40
here the owner of this photo this
76:44
deletes me from the access control list
76:47
and inserts a new photograph that I'm
76:50
not supposed to see in the list list
76:51
right so c2 does a you know a port of
76:56
access control is to delete me and then
76:59
a put of the photo list to add a photo
77:01
I'm not allowed to see then my client
77:05
gets around to the second get it sees
77:07
this list you may see this list which is
77:09
the now the updated list that has the
77:11
photo I'm not allowed to see but my
77:12
client thinks aha I'm in the access
77:14
control list because it's reading an old
77:15
one
77:18
and here's this photo so I'm allowed to
77:20
see you so in that case you know we're
77:23
getting an inconsistent what we know to
77:26
be an inconsistent sort of combination
77:29
of a new list and an old access control
77:32
list but there was really nothing but
77:34
but causal consistency allows this it'll
77:38
it calls a consistency only says well
77:40
you're gonna see data that's at least as
77:42
new as the dependencies every time you
77:46
do a get so and indeed if you know as
77:51
the paper points out if you think it
77:52
through it's also not correct for the
77:54
reading client to first read the list of
77:58
photos and then read the access control
78:01
this because sneaking in between this
78:05
though this might have a this that I
78:08
read may have a photo I'm not allowed to
78:09
see and at that time maybe the access
78:11
control this didn't include me but at
78:14
this point the owner of the list may
78:15
delete the private photo add me to the
78:18
access control list and then I may see
78:20
myself in the list so again if we do it
78:22
in this order it's also not right
78:23
because we might see we might get an old
78:28
list and a new access control list so
78:31
causal consistency as I've described it
78:34
so far isn't powerful enough to deal
78:36
with this situation you know we need
78:37
some notion of being able to get a
78:39
mutually consistent list and access
78:42
control lists through either sort of
78:45
both before some update or both after
78:48
and cops GT actually provides a way of
78:52
doing this it by essentially doing both
78:55
Gatz
78:57
but but cops GT sends the full set of
79:02
dependencies back to the client when it
79:04
doesn't get and that means that the
79:07
client is submit is in a position to
79:08
actually check the dependencies of both
79:10
of these return values and see that aha
79:14
you know there's a dependency for list
79:16
that is a version of these sorry for
79:20
that there might be a the dependency
79:23
list for the access control list me
79:25
mention that it depends on a version of
79:28
this that's in the farther ahead than
79:30
the version of Lists the
79:31
begone and in that case cops GT would be
79:34
fetch the data alright with one question
79:41
is it related to the threat of execution
79:45
yeah so it's true their causal
79:50
consistency doesn't really it's not
79:52
about wall clock time so it has no
79:55
notion of wall clock time there's only
79:58
the only sort of forms of order that
80:00
it's obeying that are even a little bit
80:03
related to walk walk time or that if a
80:05
single thread does one thing and then
80:07
another and another
80:08
then causal consistency does consider
80:12
these three operations to be in that
80:13
order but it's because one client thread
80:15
did these sequence of things and not
80:18
because there was a real time
80:19
relationship so just a wrap up here to
80:24
sort of put this into a kind of larger
80:26
world context causal consistency has has
80:32
been an is like a very kind of promising
80:37
research area and has been for a long
80:39
time because it does seem like it might
80:41
provide you with good enough consistency
80:43
but also opportunities more
80:47
opportunities and linearise ability to
80:49
get high performance however it hasn't
80:52
actually gotten much traction in the
80:54
real world
80:55
people use eventual consistency systems
80:58
and they use strongly consistent systems
81:00
but it's very rare to see a deployed
81:02
system and as causal consistency and
81:04
there's a bunch of reasons potential
81:06
reasons for that you know it's always
81:09
hard to tell exactly why people do or
81:11
don't use some technology for real-world
81:15
systems one reason is that it can be
81:19
awkward to track per client causality in
81:24
the real world a user and browser are
81:25
likely to contact different web servers
81:27
at different times and that means it's
81:31
not enough for a single web server to
81:32
keep users context we need some way to
81:36
stitch together context for a single
81:38
user as they visit different web servers
81:40
at the same website so that's painful
81:43
I know there is this problem that cops
81:45
doesn't doesn't track only tracks causal
81:48
dependencies it knows about and that
81:50
means it doesn't doesn't have a sort of
81:52
ironclad solution or doesn't sort of
81:55
provide ironclad causality and only sort
81:57
of certain kinds of causality which is
82:00
well sort of limits how appealing it is
82:05
another is that the you know eventual
82:09
and causal consistent systems can
82:11
provide only the most limited notion of
82:13
transactions and people more and more I
82:16
think as time goes on are sort of
82:19
wishing that their storage systems had
82:21
transactions I'm finally the amount of
82:25
overhead required to push around a track
82:27
and store all that dependency
82:30
information can be quite significant and
82:33
you know I was unable to kind of detect
82:35
this in the performance section of the
82:38
paper but the fact is it's quite a lot
82:40
of information that has to be stored and
82:41
pushed around and it you know if you
82:44
were hoping for the sort of millions of
82:46
operations per second level of
82:47
performance that at least Facebook was
82:50
getting out of memcache D the kind of
82:52
overhead that you would have to pay to
82:54
use causal consistency might be
82:57
extremely significant for the
82:59
performance so those are reasons why I'm
83:02
causal consistency maybe hasn't
83:03
currently caught on although maybe
83:08
someday it will be okay that's all I
83:12
have to say and actually starting next
83:14
lecture we'll be switching gears away
83:16
from storage and sequence of three
83:20
lectures that involve block chains so
83:24
I'll see you on Thursday
83:27
you