字幕記錄


00:01
all right hello everyone I'm today going
00:07
to talk about this paper about how
00:10
Facebook uses memcache I'm in order to
00:15
handle enormous load the reason we're
00:19
reading this paper is that it's an
00:21
experience paper there's not really any
00:23
new concepts or ideas or techniques here
00:26
but it's kind of what a real live
00:32
company ran into when they were trying
00:36
to build very high capacity
00:38
infrastructure there's a couple of ways
00:40
you could read it one is a sort of
00:44
cautionary tale about what goes wrong if
00:47
you don't take consistency seriously
00:49
from the start another way to read it is
00:53
that it's an impressive story about how
00:55
to get extremely high capacity from
00:58
using mostly off-the-shelf software
01:03
another way to read it is that it's a
01:05
kind of illustration or the fundamental
01:08
struggle that a lot of setups face
01:11
between trying to get very high
01:13
performance which you do by things like
01:16
replication and how to give consistency
01:18
for which techniques like replication
01:20
are really the enemy and so you know we
01:24
can argue about whether we like their
01:26
design or we think it's elegant or a
01:28
good solution but we can't really argue
01:31
with how successful they've been so we
01:35
do need to take them seriously and for
01:37
me actually this paper which I first
01:40
read quite a few years ago it's been I
01:43
thought about it a lot and it's been
01:45
sort of a source of sort of ideas and
01:51
understanding about problems at many
01:55
points
01:57
all right so before talking about
02:01
Facebook proper you know they're an
02:03
example of a pattern that you see fairly
02:06
often or that many people have
02:07
experienced in which they're trying to
02:09
build a website to do something and you
02:11
know typically people who build websites
02:13
are not interested in building high
02:15
performance you know high performance
02:20
storage infrastructure they're
02:22
interested in building features that
02:24
will make their users happy or selling
02:26
more advertisements or something you
02:29
know so they they're not gonna start by
02:31
spending a main year of effort or a
02:33
whole lot of time building cool
02:35
infrastructure they're gonna start by
02:36
building features in that they'll sort
02:38
of make infrastructure better only to
02:41
the extent that they really have to
02:42
because you know that's the best use of
02:44
their time alright so a typical starting
02:47
scenario in a ways when a some website
02:51
is very small is you know there's no
02:54
point in starting with anything more
02:55
than just a single machine right you
02:57
know maybe you started you only have a
02:59
couple users sitting in front of their
03:01
browsers and you know they talk over the
03:05
internet here with your single machine
03:07
your single machine is gonna maybe run
03:09
the Apache web server now maybe you
03:16
write the scripts that produce web pages
03:20
using PHP or Python or some other
03:23
convenient easy to program sort of
03:27
scripting style language and Facebook
03:29
uses PHP you need to store your data
03:33
somewhere or you can just download sort
03:36
of standard database and Facebook happen
03:41
to use my sequel my school is a good
03:42
choice because it implements the sequel
03:45
query language is very powerful and acid
03:49
transactions provides durable storage so
03:51
this is like a very very nice set up I
03:54
am will take you a long way actually but
03:58
supposing supposing you get successful
04:00
you get more and more users you know
04:02
you're gonna get more and more load more
04:04
and more people gonna be viewing your
04:07
website and running whatever PHP stuff
04:09
you're with
04:10
site provides and so at some point
04:14
almost certainly the first thing that's
04:17
going to go wrong is that the PHP
04:19
scripts are gonna take up too much CPU
04:21
time that's usually the first bottleneck
04:25
people encounter if they start with a
04:26
single server so what you need is some
04:29
way to get more horsepower for your PHP
04:31
scripts and so that takes us to kind of
04:33
architecture number two for websites in
04:38
which you know you have lots and lots of
04:40
users right or more users than before
04:44
you need more CPU power for your PHP
04:47
scripts so you run a bunch of front end
04:50
servers whose only job is to run the web
04:53
servers that users browsers talk to and
04:57
these are called front end servers so
05:00
these are going to run a patch either
05:04
webserver and the PHP scripts now you
05:09
know you users are going to talk to
05:10
different servers at different times
05:11
maybe your users Quadra each other they
05:14
message each other they need to see each
05:16
other's posts or something
05:17
so all these front-end servers are going
05:19
to need to see the same back-end data
05:23
and in order to do that you probably
05:27
can't just stick at least for a while
05:28
you can just stick with one database
05:30
server so you gonna have a single
05:31
machine already my sequel that handles
05:35
all of the database all queries and
05:37
updates reads and writes from the front
05:40
end servers and if you possibly can it's
05:43
wise to use a single server here because
05:45
as soon as you go with two servers and
05:46
somehow your data over multiple database
05:49
servers like gets much more complicated
05:51
and you need to worry about things like
05:54
do you need distributed transactions or
05:56
how it has the PHP scripts decide which
05:58
database server to talk to and so again
06:01
you can get a long way with this second
06:03
architecture you have as much CPU power
06:05
as you like by adding more front end
06:07
servers and up to a point a single
06:11
database server will actually be able to
06:13
absorb the reason rights of many front
06:16
ends but you know maybe you're very
06:18
successful you get even more users and
06:22
so the question is what's gonna go wrong
06:24
next and typically what goes wrong next
06:27
is that the database server since you
06:30
can always add more CPU more web servers
06:33
you know what inevitably goes wrong is
06:35
that after a while the database server
06:37
runs out of steam okay so what's the
06:43
next architecture this is web
06:49
architecture 3 and the kind of standard
06:51
evolution of big websites here we have
06:55
the same if you know now thousands and
06:57
thousands of users lots and lots of
07:01
front ends and now we basically we know
07:04
we're gonna have to have multiple
07:07
database servers so now behind the front
07:10
ends we have a whole rack of database
07:14
servers each one of them running my
07:16
sequel again
07:18
but we're going to shard the data
07:21
we're driven now to sharding the data
07:22
over the database server so you know
07:25
maybe the first one holds keys you know
07:27
a through G & G through second one holds
07:32
keys G through Q and you know whatever
07:34
the charting happens to be and now the
07:37
front-end you know you have to teach
07:38
your PHP scripts here to look at the
07:40
data they need and try to figure out
07:42
which database server they're going to
07:43
talk to it you know in different times
07:45
for different data they're going to talk
07:46
to different servers so this is sharding
07:51
and of course the reason why this gives
07:53
you a boost is that now the all the work
07:57
of reading and writing has split up
07:58
hopefully hopefully evenly split up
08:01
between these servers since they hold
08:04
different data now replicas rating word
08:05
charting the data and they can execute
08:08
in parallel and have big parallel
08:09
capacity to read and write data it's a
08:14
little bit painful the PHP code has to
08:17
know about the sharding if you change
08:19
the setup of the database servers that
08:21
you add a new database server or you
08:23
realize you need to split up the keys
08:25
differently you know now you need a
08:27
you're gonna have to modify the software
08:29
running on the front ends or something
08:31
in order for them to understand about
08:33
how to cut over to the new sharding so
08:35
there's some there's some pain here
08:37
there's also if you need transactions
08:40
and you know many people use them if you
08:42
need transactions but the data involved
08:45
in a single transaction is on more than
08:47
one database server you're probably
08:49
going to need two-phase commit or some
08:51
other distributed transaction scheme
08:53
it's also a pain and slow all right well
08:59
you can you can get fairly far with this
09:03
arrangement however it's quite expensive
09:07
my sequel or sort of you know fully
09:10
featured database servers like people
09:12
like to use it's not particularly fast
09:14
it can probably
09:18
perform a couple hundred thousand reads
09:20
per second and far fewer rights and you
09:24
know web sites tend to be read heavy so
09:29
it's likely that you're gonna run out of
09:32
steam for reads before writes that
09:36
traffic will be that we load on the web
09:38
servers will be dominated by reads and
09:40
so after a while you know you can slice
09:43
the data more and more thinly over more
09:45
and more servers but two things go wrong
09:49
with that one is that the some sometimes
09:51
you're you have specific keys that are
09:54
hot that are used a lot and no amount of
09:57
slicing really helps there because each
09:59
key is only on a single server so that
10:02
keeps very popular that servers can be
10:03
overloaded no matter how much you
10:04
partition or shard the data and the
10:09
other problem with adding was shorting
10:10
adding lots and lots of my sequel
10:13
database servers for sharding is that
10:17
it's really an expensive way to go as it
10:20
turns out and after a point you're gonna
10:21
you're going to start to think that well
10:23
instead of spending a lot of money to
10:25
add another database server running my
10:27
sequel I could take the same server run
10:30
something much faster on it like as it
10:32
happens memcache D and get a lot more
10:35
reads per second out of the same
10:37
Hardware using caching than using
10:40
databases so the next architecture and
10:45
this is now starting to resemble what
10:48
Facebook is using the next architecture
10:51
still need users we still have a bunch
10:56
of front end servers running web servers
10:59
in PHP and by now maybe a vast number of
11:03
front end servers we still have our
11:05
database servers because you know we
11:06
need us a system that will store data
11:11
safely on disk for us and we'll provide
11:13
things like transactions for us and so
11:17
you know probably want a database for
11:19
that but in between we're gonna have a
11:21
caching layer that's this is where
11:23
memcache D comes in
11:26
and of course there's other things you
11:27
could use that the memcache but memcache
11:29
D happens to be an extremely popular
11:31
caching scheme the idea now is you have
11:34
a whole bunch of these memcache servers
11:36
and when a front-end needs to read some
11:41
data the first thing it does is ask one
11:45
of the memcache servers look do you have
11:47
the data I need
11:48
so it'll send a get request with some
11:50
key to one of the memcache servers and
11:54
the memcache server will check it's got
11:56
just a table in memory it's in fact
11:58
memcache is extremely simple it's far
12:00
far simpler than your lab 3 for example
12:04
it just has just as a big hash table on
12:07
memory it checks with that keys in the
12:09
hash table if it is it sends back the
12:11
data saying oh yeah here's the value
12:12
I've cashed for that and if we if the
12:15
front end hits in this memcache server
12:16
great I can then produce the webpage
12:19
with that data in it if it misses in the
12:21
webserver though the front-end has to
12:23
then rear equesticle irrelevant database
12:27
server and the database server will say
12:30
oh you know here's the here's the data
12:33
you need and at that point in order to
12:36
cash it in for the next front-end that
12:39
needs it the front end we'll send a put
12:42
with the data it fashion the database
12:44
into that memcache server and because
12:48
memcache runs at least 10 and maybe
12:50
maybe more than 10 times faster for
12:52
weeds than the database for a given
12:55
amount of hardware it really pays off to
12:58
use a fair amount some of that hardware
12:59
for memcache as well as for the database
13:02
servers so people people use this
13:05
arrangement a lot and it just saves them
13:06
money because memcache is so much faster
13:09
for weeds than a database server still
13:12
need to send writes to the database
13:13
because you want right to an updates to
13:15
be stored durably on the database as
13:20
this can still be there if there's a
13:22
crash or something but you can send the
13:26
Reese to the cache very much more
13:28
quickly ok so we have a question the
13:30
question is why wouldn't the memcache
13:32
server actually hit the put on behalf of
13:34
the front-end and cache the response
13:36
before responding the front-end so
13:37
that's a great question
13:39
you could imagine a caching layer that
13:41
you would send a get to it and it would
13:43
if it missed the memcache layer would
13:45
would forward the request to the
13:48
database babies respond the memcache
13:50
memcache would add the data to its
13:52
tables and then respond and the reason
13:55
for this is that memcache is like a
13:59
completely separate piece of software
14:00
that it doesn't know anything about
14:02
databases and it's actually not even
14:04
necessarily used and combined in
14:06
conjunction with the database although
14:08
it often is so we can't bake in
14:12
knowledge of the database into memcache
14:14
and sort of deeper reason is that the
14:18
front ends are often not really storing
14:22
one for one database records in memcache
14:25
almost always or very frequently what's
14:28
going on is that the front-end will
14:30
issue some requests to the database and
14:32
then process the results somewhat you
14:35
know maybe take a few steps to turning
14:37
it into HTML or sort of collect together
14:42
you know results from multiple careers
14:46
on multiple rows in the database and
14:47
cached partially processed information
14:49
in memcache
14:51
just to save the next reader from having
14:53
to do the same processing and for that
14:56
reason memcache it doesn't really does
14:59
not understand the relationship between
15:01
what the friends would like to see see
15:03
cached and how did you ride that data
15:06
from the database that knowledge is
15:07
really only in the PHP code on the front
15:09
end so therefore even though we could be
15:12
architectural a good idea we can't have
15:14
this integration here sort of direct
15:17
contact between memcache and the
15:20
database although it might make the
15:22
cache consistency story much more
15:25
straightforward and yes
15:27
this is it's this and answer the next
15:31
question that is the difference between
15:33
a lookaside cash and a look through cash
15:37
the fact the lookaside business is that
15:40
the front end sort of looks asides to
15:42
the cash to see if the data is there and
15:44
if it's not it makes its own
15:45
arrangements for getting the deed on
15:48
amiss you know a look through cash my
15:51
forward request of the database and
15:53
directly and handle the response now
15:57
part of the reason for the popularity in
15:59
memcache is that it is it is a lookaside
16:01
cash that is completely neutral about
16:04
whether there's a database or what's in
16:07
the database or the relationship between
16:09
stuff in memcache and what's in the end
16:11
items in the database all right so this
16:17
is very popular arrangement very widely
16:18
used it's cost effective because
16:21
memcache is so much faster in the
16:23
database it's a bit complex every
16:27
website that makes serious you so this
16:29
faces the problem that if you don't do
16:33
something the data that's stored in the
16:35
caches will get out of sync with the
16:37
data in the database and so everybody
16:40
has to have a story for how to make sure
16:42
that when you modify something in the
16:43
database you do something to memcache to
16:48
you know take care of the fact that
16:50
memcache may then be storing stale data
16:52
that doesn't reflect the updates and a
16:54
lot of this papers about what Facebook
16:56
story is for that although other people
16:58
had other plans this arrangements also
17:05
potentially a bit fragile it allows you
17:08
to scale up to far more users then you
17:11
could have gone with databases alone
17:13
because memcache is so fast but what
17:15
that means is that you're gonna end up
17:17
with the system that's sustaining a load
17:20
that's far far higher you know orders of
17:24
magnitude higher than what the databases
17:25
could handle and thus if anything goes
17:29
wrong for example if one of your
17:30
memcache servers were to fail and
17:34
meaning that the front ends would now
17:36
have to contact the database because
17:37
they didn't hit they couldn't use this
17:40
to store data
17:41
you're gonna be increasing a load in the
17:42
databases dramatically right because
17:45
memcache do you know supposing it has a
17:47
you know hit rate of 99 percent or
17:50
whatever it happens to be you know
17:53
memcache is gonna be absorbing almost
17:55
all the reads the database backends only
17:58
going to be seeing a few percent of the
18:00
total reads so any failure here is gonna
18:04
increase that few percent of the reads
18:06
to maybe you know I don't know 50
18:07
percent of the reads or whatever which
18:09
is a huge huge order of magnitude
18:11
increase so as Facebook does once you've
18:16
got to rely on this caching layer you
18:18
need to be set up pretty serious
18:24
measures to make sure that you never
18:26
expose the database layer to the full
18:30
anything like the full load that the
18:35
caching layer is seeing and you know you
18:37
see in facebook they have quite a bit of
18:40
thought put into making sure the
18:43
databases don't ever see anything like
18:44
the full load okay
18:48
so far this is generic now I want to
18:54
sort of switch to a big picture of what
18:58
Facebook describes in the paper for
19:00
their overall architecture of course
19:04
they have lots of users every user as a
19:05
friendless and status and posts and
19:08
likes and photos but Facebook's very
19:13
easy or e nted towards showing data to
19:16
users and a super important aspect of
19:21
that is that fresh data is not
19:24
absolutely necessary in that
19:26
circumstance you know suppose the reads
19:31
are you know due to caching supposed to
19:34
reads yield data that's a few seconds
19:36
out of date so you're showing your users
19:38
data not the very latest data but the
19:40
data from a few seconds ago you know
19:41
what the users are extremely unlikely to
19:45
notice except in special cases right if
19:48
I'm looking at a news feed of today's
19:50
you know today's news you know if I see
19:53
the news from a few
19:54
times ago versus the news from now a big
19:57
deal nobody's gonna notice nobody's
20:00
gonna complain you know that's not
20:01
always true for all data but for a lot a
20:03
lot of the data that they have to deal
20:05
with sort of super up-to-date
20:07
consistency in the sense of like
20:08
linearise ability is not actually
20:11
important what is important is that you
20:14
don't cache stale data indefinitely you
20:17
know what they can't do is by mistake
20:20
have some data that they're showing
20:22
users that's from yesterday or last week
20:25
or even an hour ago those users really
20:28
will start to notice that so they don't
20:32
care about consistency like
20:34
second-by-second but they care a lot
20:36
about not not being in cannot chewing
20:41
stale data from more than well more than
20:44
a little while ago the other situation
20:47
in which they need to provide
20:48
consistency is if a user updates their
20:51
own data or if a user updates almost any
20:54
data and then reads that same data that
20:56
the human knows that they just updated
20:59
it's extremely confusing for the user to
21:02
see stale data if they know they just
21:04
changed it and so in that specific case
21:07
the Facebook design is also careful to
21:11
make sure that if a user changes data
21:14
that that user will see the change data
21:19
ok so Facebook has multiple data centers
21:23
which they call regions and I think at
21:27
the time this paper was written they had
21:30
two regions their sort of primary region
21:33
was on the west coast California and
21:36
their sort of secondary region was in
21:40
the East Coast and the two data centers
21:43
look pretty similar you
21:51
set of database servers running my
21:52
sequel the sharted date over these my
21:56
sequel database servers
21:59
they had a bunch of memcache D servers
22:03
which we'll see they are actually
22:05
arranged in independent clusters and
22:07
then they had a bunch of front ends
22:09
again sort of a separate arrangement in
22:15
each data center and there's a couple
22:20
reasons for this one is that their
22:22
customers were scattered all over the
22:24
country and it's nice just for a
22:27
performance that people on the East
22:28
Coast can talk to a nearby data center
22:29
and people on the west coast can also
22:31
talk to a nearby deficit it just makes
22:33
internet delays less now the the data
22:42
centers were not symmetric each of them
22:44
held a complete copy of all the data
22:46
they didn't sort of shard the data
22:47
across the data centers so the West
22:50
Coast I think was a primary and it sort
22:54
of had the real copy of the data and the
22:55
East Coast was a secondary and what that
23:00
really means is that all rights had to
23:01
be sent to the relevant database and the
23:05
primary day to be Center so you know any
23:08
right gets sent you know here and they
23:12
use a feature of my sequel they serve
23:15
asynchronous log replication scheme to
23:17
have each database in the primary region
23:21
send every update to the corresponding
23:24
database in secondary region so that
23:26
with a lag of maybe even a few seconds
23:29
these database servers would have
23:31
identical content the secondary database
23:33
servers would have identical content to
23:35
the primaries reads though we're local
23:38
so these front ends when they need to
23:39
find some data I'm in general would talk
23:41
to memcache memcache in that data center
23:44
and if they missed in memcache they
23:46
talked to the they'd read from the
23:48
database in that same data data center
23:52
um again though the databases are
23:56
complete replicas all the data's on both
23:59
of these these in both of these regions
24:04
that's the overall picture the next
24:10
thing I want to talk about is a few
24:11
details about how they how they use you
24:18
know with this leukocyte caching
24:21
actually looks like so there's really
24:25
there's reads and writes and this is
24:28
just what's shown in Figure two for a
24:31
read
24:33
which is executing on a front-end the
24:37
first thing if you read any data that
24:39
might be cached the first thing that
24:40
code in the front-end does is makes this
24:43
get library call with the key of the
24:46
data they want and get just generates an
24:48
RPC to the relevant memcache server so
24:51
they hash this library routine hashes on
24:56
the client hashes the key to pick the
24:57
memcache server and sends an RPC to that
25:01
mcat server them casually reply yes
25:04
here's your data or or maybe it'll point
25:06
nil saying I don't have that data it's
25:10
not cached so if if V is nil then the
25:18
front-end will issue whatever sequel
25:22
query is required to fetch the data from
25:27
the database and then make another RPC
25:32
call
25:36
- memk2 the relevant memcache server to
25:39
install the fetch data in the memcache
25:41
server so this is just the routine I
25:44
talked through before
25:45
it's kind of what lookaside caching does
25:48
and for right
25:55
you know V is the writing we have a key
25:59
and a value no Rana right and so library
26:01
routine on an each front end we're gonna
26:05
send the the new data to the database
26:11
and you know I as I mentioned before the
26:15
Keene the value may be a little bit
26:16
different you know what's stored in the
26:17
database is often in a somewhat
26:19
different form from what's stored in
26:21
memcache see but we'll imagine for now
26:23
the same and once the database has the
26:26
new data then the right library routine
26:29
sends an RPC to memcache detailing it
26:34
look you got to delete this key so I
26:38
want to write the writer is invalidating
26:41
the key in memcache do you know what
26:43
that means is that the next front-end
26:47
that tries to read that key from
26:49
memcache D is gonna get nil back because
26:52
it's no longer cached and will fetch the
26:54
updated value from the database and
26:56
install it into memcache all right so
27:01
this is an invalidation in particular
27:04
it's not you could imagine a scheme that
27:06
would send the new data to memcache T at
27:08
this point but it doesn't actually do
27:10
that instead of gliese it and actually
27:13
in the context of facebook scheme the
27:17
real reason why this delete is needed is
27:21
so that
27:24
we'll see their own rights because in
27:26
fact in their scheme the mem cat the my
27:29
sequel server the database servers also
27:32
send deletes one of you and the front
27:35
end writes something in the database the
27:37
database with the mix squeal mechanism
27:39
the paper mentions well send the
27:42
relevant deletes to the memcache servers
27:44
that that might hold this key so the
27:47
data the database servers will actually
27:49
invalidate stuff in memcache by-and-bye
27:51
may take them a while um but because
27:54
that might take a while the front ends
27:57
also delete the key said that a front
27:58
end won't see a stale value for data
28:04
that it just updated okay
28:16
all sort of the background of this is
28:21
pretty much how everybody uses memcache
28:23
G there's nothing yet really very
28:24
special here now eventually you know the
28:28
paper is all about on the surface all
28:31
about solving consistency problems and
28:34
indeed those are important but the
28:36
reason why they got where they ran into
28:39
those consistency problems is in large
28:41
part because they you know modify the
28:46
design or set up a design that had
28:47
extremely high performance because they
28:49
had extremely high load and say they
28:51
were desperate to get performance and
28:54
kind of struggled along behind the
28:56
performance improvements in order to
28:59
retain a reasonable level of consistency
29:02
and because the performance kind of came
29:04
first for them I'm actually going to
29:05
talk about
29:07
their performance architecture before
29:10
talking about how they fix the
29:14
consistency okay sorry there's been a
29:18
bunch of questions here that I haven't
29:21
seen let me take a peek okay so one
29:28
question this means that the replicated
29:30
updates from the primary my sequel
29:32
database to the secondary must also
29:34
issue deletes - yeah so this is I think
29:39
a reference to the previous or
29:40
architecture slide the observation is
29:43
that yes indeed when a front-end sends a
29:47
write to the database server today every
29:49
server updates its data on disk and it
29:53
will send an invalidate a delete to
29:56
whatever memcache server there is in the
29:58
local region the local data center that
30:00
might have had the key that was just
30:02
updated the database server also sends a
30:05
sort of representation of the update to
30:07
the corresponding database serve in the
30:09
other region which process it applies
30:11
the right to its disk data on disk it
30:15
also using them excuse sort of log
30:18
reading apparatus figures out which
30:22
memcache server might hold the key that
30:25
was just updated and sends it delete
30:27
also to that memcache server so that the
30:32
if it's the key is cache is invalidated
30:34
in in both data centers okay so another
30:40
question what would happen if we delete
30:41
first in the right and then send to the
30:45
database
30:48
so that's or with reference to this
30:50
thing here would what if we did to the
30:53
feet first you know if you do delete
30:58
first then you're increasing the chances
31:01
that some other clients so supposing you
31:03
delete and then send to database right
31:11
in here if another client reads that
31:13
same key they're gonna miss at this
31:14
point they're gonna fetch the old data
31:18
from the database and they're gonna then
31:20
insert it
31:21
cash and then you're going to update it
31:22
leaving memcache for a while at least
31:25
with stale data and then if this the
31:28
writing client reads it again it may see
31:30
the stale data even though it just
31:31
updated it doing the delete second um
31:34
you know these over the possibility that
31:37
somebody will read during this period of
31:40
time and see steal data but they're not
31:43
worried about stale data in general
31:44
they're really most worried in this
31:47
context about clients reading their own
31:50
rights so on balance even though there's
31:52
a consistency problem by the way I'm
31:56
doing the delete second ensures that
32:00
clients will be their own rights in
32:02
either case eventually the database
32:06
server as I'm just mentioned will send a
32:09
delete for the written keys
32:19
another question I'm confused on how
32:21
writing the new value shows stale data
32:24
but deleting doesn't let me see I'm not
32:40
really sure what the question is asking
32:42
the if it's with reference to this code
32:47
once the writes done okay maybe the
32:53
question is it's really we didn't do
32:54
delete at all so that when a client a
32:59
front web for an end did or wanted to
33:02
update him David would just tell the
33:03
database but not explicitly delete the
33:07
data from memcache the problem with this
33:10
is that if the client sent this write to
33:17
the database and then immediately read
33:19
the same data that read would come out
33:22
of the memcache and because memcache
33:24
still has the old data
33:26
you know memcache hasn't seen this right
33:27
yet a client that updated some data and
33:30
then read it you know updates it in the
33:32
database but it reads the data if the
33:34
stale data from memcache and then a
33:36
client might update some data but still
33:38
see the old data and if you delete it
33:41
from memcache then if a client if you do
33:44
do this delete than a client that writes
33:46
some data and deletes it from memcache
33:49
and then reads it again it'll miss in
33:51
memcache because of the delete and they
33:54
don't have to go to the database and
33:55
read the data and the database will give
33:56
it fresh data okay so the question is
34:03
how come why do we delete here
34:06
gosh why don't we just instead of this
34:10
delete have the client just directly
34:13
since it knows the new data just send a
34:18
set RPC - memcache T and this is a this
34:24
is a good question
34:24
and so here we're doing I have an
34:26
invalidate scheme this would often be
34:29
called an update scheme and let me try
34:34
to cook up an example that shows that
34:37
while this could probably be made to
34:39
work this update scheme it doesn't work
34:45
it doesn't work out of the box and you
34:47
wouldn't you need to do some careful
34:48
design in order to make it work so this
34:50
wasn't client wants posing up now we
34:51
have two clients
34:57
reading and writing the same key
34:59
interleaved so let's say client one
35:05
tells the database you know sends X plus
35:11
plus to the database right just
35:15
incrementing X and then of course or let
35:19
me say it's going to increment X from
35:22
zero to one so set X to one and then
35:25
after that client one is going to call
35:30
set of our key which is X and the value
35:36
one and write that at the memcache D
35:40
supposing meanwhile client two also
35:43
wants to increment X so it's going to
35:46
read this latest value in the database
35:48
and almost certainly these are in fact
35:50
transactions so what if we were doing
35:54
increment what client won't won't be
35:55
sending would be some sort of increment
35:57
transaction on the database for
35:58
correctness because the database does
36:00
support transactions so we're going to
36:02
imagine the client to increments the
36:04
value of x to to sends that increment to
36:07
the database and client two also is
36:09
going to do this set so it's going to
36:12
set X to be two but now what we're left
36:15
with is the value of one in memcache D
36:19
even though the correct values and the
36:21
databases to which is to say if we do
36:25
this update was set even though it does
36:28
save us some time right cuz now we're
36:29
saving somebody a miss in the future
36:31
because we directly said instead of
36:33
delete we also run the risk if the if
36:36
it's popular data of leaving stale data
36:39
in the database it's not that you
36:41
couldn't get this to work somehow but it
36:45
does require some careful thought to fix
36:49
this problem all right so that was why
36:55
they use invalidate and instead of
36:58
update okay so I was going to
37:03
about performance they this sort of
37:07
route of how they get performance is
37:09
through parallel parallelization
37:12
parallel execution and for a storage
37:16
system just at a high level there's
37:18
really two ways that you can get a good
37:21
performance
37:22
one is by partition which is sharding
37:27
that is you take your data and you split
37:30
it up over you know into ten pieces over
37:32
ten servers and those ten servers can
37:34
run independently hopefully the other
37:37
way you can use extra hardware to get
37:39
higher performance despite replication
37:42
at is have more than one copy of the
37:48
data and you kind of for a given amount
37:51
of hardware you can kind of choose
37:53
whether to partition your data or
37:55
replicate it in order to use that
37:56
hardware and there's you know from
38:04
memcache see what we're talking about
38:05
here is is splitting the data over the
38:09
available memcache servers by hashing
38:11
the key so that every key sort of lives
38:14
on one memcache server and from memcache
38:16
what we would be talking about here is
38:18
having each front-end just talk to a
38:21
single memcache server and send all its
38:23
requests there so that each memcache
38:25
server serves only a subset of the front
38:29
ends and sort of serves all their needs
38:31
and Facebook actually uses a combination
38:35
of both partition and replication for
38:39
partition the things that are in its
38:41
favor one is that it's memory efficient
38:44
because you only store a single copy of
38:48
each item Abita where's in replication
38:50
you're gonna store every piece of data
38:52
maybe on every server on the sort of
39:00
partition is that it's as long as your
39:03
keys are sort of equally roughly equally
39:05
popular works pretty well but if there's
39:07
some hot a few hot keys partition
39:10
doesn't really help you much once you
39:12
get those partition enough that those
39:13
hot keys are on different servers you
39:17
know once the if there's a single hot
39:19
key for example no amount of
39:20
partitioning helps you because no matter
39:22
of how much you partition that hot key
39:25
is still sitting on just one server
39:34
the problem partition is that it doesn't
39:38
mean that the front if front ends need
39:40
to use lots of data lots of different
39:42
keys it means in the end each front-end
39:44
is probably going to talk to lots of
39:47
partitions and at least if you use
39:51
protocols like TCP that keep state
39:53
there's significant overhead to as you
39:58
add more and more sort of N squared
40:00
communication for a replication it's
40:09
fantastic if if your problem is that a
40:12
few keys are popular because now you
40:17
know you're making replicas of those
40:18
those hotkeys and you can serve each
40:20
replica the same key in parallel it's
40:24
good because there's fewer this there's
40:26
not n squared communication each
40:28
front-end maybe only talks to one
40:30
memcache server but the bad thing is
40:38
it's there's a copy of data in every
40:40
server you can cache far fewer distinct
40:43
data items with replication then with
40:47
partition so there's less total data can
40:53
be stored so these are just generic for
40:57
pros and cons of these two main ways of
41:00
using extra hardware to get higher
41:02
performance alright so I want to talk a
41:07
bit about there when one sort of context
41:11
in which they use partition and
41:12
replication is at the level of different
41:15
regions so I just want to talk through
41:21
why it is that they decided to have
41:26
separate regions and kind of separate
41:28
complete data center with all the data
41:30
in each of the regions so I before I do
41:33
that there's a question why can't we
41:35
cache the same amount of data with
41:38
replication ok so supposing you have 10
41:42
machines each with a gigabyte of RAM and
41:46
you can use these 10 machines each with
41:48
a gigabyte of RAM for either replication
41:50
or in a partitioning scheme if you use a
41:53
partitioning scheme where each server
41:56
stores different data from the other
41:58
servers that you can store a total of 10
42:01
gigabytes of distinct data objects on
42:04
your 10 servers each with a gigabyte of
42:07
RAM so with partition you know each byte
42:09
of ram is used for different data so you
42:11
can look at the total amount of RAM you
42:12
have that's how much distinct data you
42:15
know different data items you can store
42:16
with replication you know assuming your
42:20
users are more or less looking at the
42:22
same stuff each each replicas each cache
42:29
replicas will end up storing roughly the
42:32
same stuff as all the other caches so
42:36
your 10 you have 10 gigabytes of RAM
42:38
still they and your 10 machines but each
42:41
of those machines stores roughly the
42:42
same data so would you end up with this
42:44
10 copies of the same gigabyte of items
42:48
so in a can this particular example if
42:51
you use replication you
42:52
snoring attempt as many distinct data
42:54
items and you know that may actually be
42:57
a good idea depending on you know sort
43:01
of way your data is like but it does
43:04
mean that replication gives you less
43:07
total data that's cached and you know
43:09
you can see there's points in the paper
43:11
word that they mention this tension
43:14
nominally they don't come down on one
43:18
side of the other because they use both
43:19
replication and charting okay okay so
43:27
the highest level at which they're
43:30
playing this game is between regions and
43:33
so it at this high level each region has
43:36
a complete replica of all the data right
43:38
they have a each region as a complete
43:41
set of database servers each database
43:42
database corresponding database servers
43:45
for the same data and assuming users are
43:47
looking at more or less the same stuff
43:49
that means the memcache servers in the
43:52
different regions are also storing more
43:55
or less basically replicating where we
43:57
have yours replicating in both the
43:59
database servers and the memcache
44:00
servers and the point again one point is
44:04
to you want a complete copy of the site
44:09
that's close to West Coast users in the
44:11
internet load early in the internet and
44:13
another copy of the complete website
44:16
this close to users on the East Coast
44:18
close on the internet again and the
44:22
Internet's pretty fast but coast to
44:23
coast is you know 50 milliseconds or
44:27
something which if you do if users have
44:30
to wait too many 50 millisecond
44:31
intervals they'll start to notice that
44:33
amount of time another reason is that
44:36
the you wanna a reason to
44:42
applicate the data between the two
44:44
regions is that these front ends to even
44:48
create a single web page for user
44:50
requests often dozens or hundreds of
44:53
distinct data items from the cache or
44:55
the databases and so the speed the
44:58
latency the delay at which a front-end
45:00
can fetch these hundreds of items from
45:03
that from the look from the memcache key
45:05
is quite important and so it's extremely
45:07
important to have the front and only
45:10
talk to only read local memcache servers
45:15
and local databases so that you can do
45:17
the hundreds of queries it needs to do
45:19
for a web page very rapidly so if we
45:21
have partitioned the data between the
45:23
two regions then a front-end you know if
45:27
I'm looking at my friends and some of my
45:28
friends are on the East Coast and some
45:29
on the west coast that means if we
45:31
partitioned that would might require the
45:33
front ends to actually make many
45:37
requests you know 50 milliseconds each
45:39
to the other data center and users would
45:45
users would see this kind of latency and
45:49
be very upset so so the reason to
45:52
another reason to replicate is to keep
45:53
the front ends always close to the data
45:56
to all the data they need of course this
46:00
makes writes more expensive because now
46:01
if a front-end and the secondary region
46:03
needs to write in estes send the data
46:05
all the way across the internet the
46:07
reads are far far more frequent than
46:10
right so it's a good trade-off although
46:13
the paper doesn't mention it it's
46:15
possible that another reason for
46:18
complete replication between the two
46:20
sites is so that if the primary site
46:23
goes down perhaps they could switch the
46:26
whole operation to the secondary site
46:27
but I don't know if they had that in
46:29
mind
46:34
okay so this is the story between
46:38
Regents is basically a story of
46:40
replication between the two data centers
46:48
all right now within a data center
46:51
within a region so in each region
47:00
there's a single set of database servers
47:07
so at the database level the data is
47:11
charted and not replicated inside each
47:14
region however at the memcache level
47:19
they actually use replication as well as
47:21
charting so they had this notion of
47:22
clusters so a given regions actually
47:26
supports multiple clusters of front-ends
47:29
and database servers so here I'm going
47:31
to have two clusters in this region this
47:34
cluster has a you know a bunch of front
47:35
ends and a bunch of memcache servers and
47:40
these are completely independent almost
47:45
completely independent so that a
47:46
front-end and cluster one sends all its
47:49
reads to the local memcache servers and
47:51
misses it needs to go to the one instead
47:54
of database servers and similarly each
47:57
front-end in this cluster
48:02
talks only to memcache servers in the
48:06
same cluster so why do they have this
48:11
multiple clusters why not just have you
48:15
know essentially a single cluster a
48:16
single set of front end servers and a
48:18
single set of memcache server is shared
48:20
by all those front ends one is that if
48:24
you did that and and that would mean you
48:26
know if you need to scale up capacity
48:27
you sort of be adding more and more
48:29
memcache servers in front ends to the
48:31
same cluster you don't get any win
48:36
therefore in performance for popular
48:38
Keys you know so there the data sort of
48:43
this memcache service is sort of a mix
48:44
you know most of it is maybe only used
48:46
by a small number of users but there's
48:48
some stuff there that lots and lots of
48:49
users need to look at and by using
48:52
replication as well as sharding they get
48:55
you know multiple copies of the very
48:57
popular keys and therefore they get sort
49:00
of parallel serving of those keys
49:02
between the different clusters another
49:07
reason to not want to increase the size
49:11
of the cluster individual cluster too
49:13
much is that all the data within a
49:16
cluster is spread over partitioned over
49:19
all the memcache servers and any one
49:21
front end is typically actually going to
49:23
need data from probably every single
49:26
memcache server eventually and so this
49:30
means you have a sort of n-squared
49:31
communication pattern between the front
49:33
ends and the memcache servers and to the
49:38
extent that they're using TCP for the
49:39
communication that involves a lot of
49:41
overhead a lot of sort of connection
49:44
state for all the different TCP so they
49:46
wanted to limit so you know this is N
49:50
squared CCP's they want to limit the
49:55
growth of this and the way to do that is
49:57
to make sure that no one cluster gets to
49:58
be too big so this N squared doesn't get
50:01
too large
50:09
and well related to that is this in
50:13
caste congestion business they're
50:14
talking about the if a frontman needs
50:17
data from lots of memcache servers it's
50:20
actually it's gonna send out the
50:21
requests more or less all at the same
50:23
time and that means this front-end is
50:25
gonna get the responses from all the
50:26
memcache servers to query it more or
50:28
less the same side time and that may
50:30
mean dozens or hundreds of packets
50:32
arriving here all at the same time which
50:34
if you're not careful we'll cause packet
50:36
losses that's in caste congestion and in
50:41
order to limit how bad that was that you
50:43
had a bunch of techniques they talked
50:44
about but one of them was not making the
50:47
clusters too large so that the number of
50:49
memcache has given front-end tend to
50:51
talk to and they might be contributing
50:53
to the same caste never got to be too
50:55
large and a final reason the paper
50:59
mentions is that it's or behind this is
51:02
is a big network in the data center and
51:04
it's hard to build networks that are
51:08
both fast like many bits per second and
51:11
can talk to lots and lots of different
51:13
computers and by splitting the data
51:16
center up into these clusters and having
51:19
most of the communication go on just
51:20
within each cluster that means they need
51:22
a smaller they need you know a modest
51:25
size fast Network for this cluster and a
51:27
modest size you know reasonably fast
51:29
network for this cluster but they don't
51:30
have to build a single network that can
51:32
sort of handle all of the traffic
51:34
between among all the computers of the
51:37
giant cluster so it limits how expensive
51:41
underlying network is on the other hand
51:44
of course they're replicating the data
51:46
and the two clusters and for items that
51:50
aren't very popular and aren't really
51:51
going to benefit from the performance
51:53
win of having multiple copies this it's
51:58
wasteful to sit on all this RAM and you
52:00
know we're talking about hundreds or
52:01
thousands of servers so the amount of
52:03
money they spent on RAM for the memcache
52:05
services is no joke so in addition to
52:10
the pool of memcache servers inside each
52:13
cluster there's also this regional pool
52:17
of memcache servers that's
52:20
shared by all the clusters in a region
52:23
and into this regional pool they then
52:28
modify the software on the front end so
52:30
that the software on the front end knows
52:32
aha
52:32
this key the data for this skis actually
52:35
not use that often instead of storing it
52:38
on a memcache server my own cluster I'm
52:41
going to store this not very popular key
52:43
in the appropriate memcache server of
52:47
the regional pool so this is
52:55
the regional pool and this is just sort
53:00
of an admission that some data is not
53:02
popular enough to want to have lots of
53:04
replicas of it they can save money by
53:06
only cashing a single copy all right so
53:14
that's how they get that's this kind of
53:15
Carol replication versus partitioning
53:18
strategy they use inside each inside
53:22
each region a difficulty they had that
53:26
they discuss is that when they want to
53:28
create a new cluster in a data center
53:32
they actually have a sort of temporary
53:34
performance problem as they're getting
53:36
that cluster going so you know supposing
53:38
they decide to install you know couple
53:41
hundred machines to be a new cluster
53:42
with the front end new front ends new
53:44
memcache errors and then they fire it up
53:47
and you know maybe cause half the users
53:50
to start using the new cluster I'm gonna
53:52
have to use the old cluster well in the
53:55
beginning there's nothing in these
53:56
memcache servers and all the front end
53:59
servers are gonna miss on the memcache
54:00
servers and have to go to the databases
54:04
and at least at the beginning until
54:06
these memcache service gets populated
54:08
with all the sort of data that's used a
54:10
lot this is gonna increase the load on
54:12
the database servers absolutely enormous
54:15
leap because before we added the new
54:18
clusters maybe the database servers only
54:19
saw one percent of the reads because
54:22
maybe these memcache servers have a hit
54:24
rate of say 99 percent for reads the
54:27
only one percent of all that means go to
54:28
the database servers before we added the
54:30
new cluster if we add a new cluster with
54:33
nothing in the memcache servers and send
54:35
half the traffic to it it's gonna get a
54:37
hundred percent miss rate initially
54:39
right and so that'll mean you know we
54:44
gone from and so the overall miss write
54:46
will now be 50 percent so we've gone
54:48
from these database servers serving one
54:51
percent of the reads to them serving 50
54:54
percent of the reads so at least in this
54:56
imaginary example we've been quite
54:58
firing up this new cluster we may
55:00
increase the load on the databases by a
55:02
factor of 50 and chances are the
55:05
database servers were running you know
55:07
reasonably Coast's the capacity and
55:09
certainly not a factor of 50/50 under
55:12
capacity and so this would be the
55:14
absolute end of the world if they just
55:17
fired up a new cluster like that and so
55:20
instead they have this cold start idea
55:26
in which a new cluster is sort of marked
55:29
by some flag somewhere as being in this
55:33
cold start state and in that situation
55:38
when a front end and the new cluster
55:40
misses that actually first first it has
55:45
its own local memcache if that says no I
55:48
don't have the data then the front end
55:49
we'll ask the corresponding memcache in
55:51
another cluster in some warm cluster
55:54
that already has the data for the data
55:56
if it's popular data chances are it'll
55:58
be cached my friend and will get its
56:01
data and then it will install it in the
56:05
local memcache and it's only if both
56:08
local memcache and the warm memcache
56:10
don't have the data that this is front
56:13
end and the new cluster will read from
56:15
the database servers and so this is it
56:20
and so they run in this kind of cold
56:22
mode for a little while the paper I
56:24
think mentions a couple hours until the
56:25
memcache servers source and the new
56:28
clusters start to have all the popular
56:30
data and then they can turn off this
56:32
cold feature and just use the local
56:35
cluster memcache alone it's alright
56:42
so another another load problem that the
56:47
paper talks about if they ran into and
56:49
this is a load problem again deriving
56:52
from this kind of look aside caching
56:55
strategies is called the thundering herd
56:59
and the the scenario
57:06
is that supposing we have some piece of
57:09
data there's lotsa memcache servers but
57:12
there's some piece of data stored on
57:13
this memcache server there's a whole
57:16
bunch of front ends that are ordinarily
57:20
reading that one piece of very popular
57:23
data so they're all sending constantly
57:25
sending get requests for that data the
57:27
memcache server has it in the cache it
57:29
answers them and you know their memcache
57:31
server is conserve like millions to
57:33
million requests per second so we're
57:36
doing pretty good and of course there's
57:39
some database server sitting back here
57:40
that has the real copy of that data but
57:42
we're not bothering it because it is
57:43
cached well suppose some front-end comes
57:46
along and modifies this very popular
57:49
data so it's going to send a write to
57:51
the database with the new data and then
57:53
it's gonna send a delete to the memcache
57:57
server because that's the way rights
57:58
work so now we've just deleted this
58:00
extremely popular data we have all these
58:02
front ends constantly sending gets for
58:04
that data they're all gonna miss all at
58:08
the same time they're all gonna now
58:13
having missed send a read request to the
58:17
front end database all at the same time
58:19
and so now this front-end database is
58:22
faced with maybe dozens or hundreds of
58:23
simultaneous requests for this data so
58:25
the Loews here is gonna be pretty high
58:27
and it's particularly disappointing
58:30
because we know that all these requests
58:33
are for the same key so the database is
58:35
going to do the same work over and over
58:36
again to respond with the latest written
58:39
copy of that key until finally the front
58:46
ends get around to installing the new
58:48
key in memcache and then people start
58:50
hitting again and so this is the
58:53
Thundering hurt what we'd really like is
58:55
a single you know if a miss if there's a
58:58
right and the leads and a miss happens
59:01
in memcache we'd like what we'd like is
59:03
the for the first front end that misses
59:05
to fetch the data and install it and for
59:07
the other front ends just like take a
59:09
deep breath then wait until the new data
59:11
is cached and that's
59:15
just what their design does if you look
59:17
at the if this thing called Elise which
59:21
is different from the pieces were used
59:26
to but they call Elise and we start from
59:30
scratch in the scenario again let's see
59:33
so now suppose we have a popular piece
59:35
of data the first front end that asks
59:40
for a data that's missing memcache Devo
59:44
will send back an error saying oh no I
59:46
don't have the data in my cache but it
59:47
will install Elise which is a bit unique
59:51
number it'll pick a least number install
59:55
it in a table and the send this lease
59:57
token back to the front end and then
60:01
other front ends that come in and ask
60:03
for the same Keith they'll simply get a
60:05
just be asked to wait you know a quarter
60:10
of a second or whatever some reasonable
60:12
amount of time by the memcache D because
60:13
the memcache key will see a haha I've
60:15
already issued the lease for that key
60:16
now there's at least potentially a V
60:18
Sparky the server will notice it's
60:21
already issued at least for the can tell
60:22
these ones to wait so only one of the
60:26
server's guess Elise this server then
60:28
asks for the data from the database when
60:33
against the responds back mom
60:35
then it sends the put for the new data
60:40
with a key and the value of God and the
60:43
least proved that it was the one who was
60:45
allowed to write the data memcache
60:46
people looking for these today aha yeah
60:48
you are the the person whose lease was
60:51
granted and it'll actually do the
60:52
install by and by these other friends
60:55
who are told the wait will reissue their
60:57
reads now that it will be there and so
61:00
we all if all goes well get just one
61:03
request to the database instead of
61:05
dozens or hundreds and I think it's the
61:08
sense in which is the lease is if the
61:10
front-end fails at an awkward moment and
61:14
doesn't actually request the data from
61:16
the database or doesn't get around it
61:17
installing it memcache D eventually
61:19
memcache D will delete the lease cuz it
61:21
times out and the next front end to ask
61:23
will get a new lease and will hope that
61:26
it will talk to the database and install
61:28
new data so yes they answer the question
61:32
the lease does up a time out in case the
61:35
first front end fails yes yes okay so
61:41
these leases are the their solution to
61:43
the Thundering Herd problem um another
61:48
problem they have is that if one of
61:49
these memcache servers fails the most
61:54
natural you-know-whats if they don't do
61:56
anything special if the memcache server
61:58
fails the front ends will send a request
62:00
they'll get back a timeout and network
62:02
will say jeez that you know I couldn't
62:04
contact that host never got a response
62:05
and what the real I BRE
62:09
software does is it then sense it
62:10
requests the database so if a memcache
62:13
server fails and we don't do anything
62:14
special the database is now going to be
62:17
exposed directly to the reads all of
62:19
these reefs and I'm catch server West
62:21
serving this is the memcache server may
62:22
well have been serving you know a
62:23
million reads per second that may mean
62:26
that the database server would be then
62:29
exposed to those million reads per
62:30
second then it's nowhere near fast
62:31
enough to deal with all those weeds now
62:37
Facebook they don't really talk about in
62:39
the paper but they do have automated
62:40
machinery to replace a failed memcache
62:44
server but that takes a while to sort of
62:48
set up a new server a new memcache
62:51
server and redirect all the front-end
62:53
to the new server instead of the old
62:55
server so in the meantime they need a
62:57
sort of temporary solution and that's
63:00
this gutter idea so let's say the scoop
63:06
is that we have our front ends we have
63:11
the sort of ordinary set of memcache
63:13
servers
63:17
the database the one of the memcache
63:21
service has failed we're kind of waiting
63:23
until the automatic memcache server
63:26
replacement system replaces this
63:27
memcache server in the meantime
63:30
friends are sending requests to it they
63:32
get a sort of server did not respond
63:35
error from the network and then there's
63:38
a presumably small set of gutter servers
63:46
whose only purpose in life is to eye
63:50
they must be idle except when a real
63:55
memcache server fails and when the front
63:57
end gets an error back saying that get
64:00
couldn't contact the memcache server
64:01
it'll send the same request to one of
64:05
the gutter servers and though the paper
64:06
doesn't say I imagine that the front end
64:08
will again hash the key in order to
64:10
choose which gutter server to talk to
64:13
and if the gutter server has the value
64:17
that's great
64:18
otherwise the front end server will
64:21
contact the database server to read the
64:22
value and then install it in the
64:25
memcache server in case somebody else
64:27
answer asks for the same data
64:28
so while this means down the gutter
64:32
servers will handle basically handle its
64:38
request and so they'll be a miss you
64:39
know handled by lease leases the
64:41
Thundering Herd
64:42
they'll be at least I need a Miss on
64:44
each of the items that was a no-fail
64:46
memcache server so there will be some
64:49
load in the database server but then
64:50
hopefully quickly this memcache server
64:51
will I get all the data that's listen
64:54
use and provide good service and then by
65:00
and by this will be replaced and then
65:01
the friends will know to talk to a
65:03
different replacement server and because
65:07
they don't and this is today's question
65:09
I think that they don't send deletes to
65:13
these gutter servers because since a
65:14
gutter server could have taken over for
65:16
anyone and maybe more than one of the
65:20
ordinary memcache service it could
65:22
actually have cache the caching any key
65:26
so that would mean that and there may be
65:30
you know friends talking to it that
65:33
would mean that whenever a front-end
65:35
needs it to delete a key from memcache
65:36
or when the squeal on the database sends
65:40
a delete for any key to the relevant
65:42
memcache server yeah you know the the
65:46
natural design would be that it would
65:47
also send a copy of that delete to every
65:51
one of the gutter servers and the same
65:53
for front ends that are deleting data
65:55
they would delete from the memcache
65:56
stores but they would also have to leave
65:58
potentially from any
65:59
MCAD gutter server that would double the
66:04
amount of defeats that had to be sent
66:05
around even though most of the time
66:06
these gutter servers aren't doing
66:08
anything and don't cache anything and it
66:09
doesn't matter and so in order to avoid
66:11
all these extra deletes they actually
66:15
fix the gutter servers so that they
66:18
delete Keys very rapidly instead of
66:22
hanging on to them until they're
66:23
explicitly deleted that was answer to
66:27
the question all right so I wanna talk a
66:34
bit about consistency all this at a
66:38
super high level you know the
66:43
consistency problem is that there's lots
66:45
of copies of the data for any given
66:47
piece of data you know there's a copy in
66:50
the primary database there's a copy in
66:53
the corresponding database server of
66:54
each of the secondary regions there's a
66:58
copy of that key in each local cluster
67:01
in one of the memcache keys in each
67:03
local cluster there may be copies of
67:06
that key and the gutter servers and
67:08
there may be copies of the key in the
67:11
memcache servers and the gutter memcache
67:13
servers at each other region so we have
67:14
lots and lots of copies of every piece
67:17
of data running around when a write
67:19
comes in you know the stuff has to
67:22
happen on all those copies and
67:23
furthermore the writes may come from
67:26
multiple sources the same key may be
67:28
written at the same time by multiple
67:30
front ends and this region may be by
67:32
friends and other regions too and so
67:35
it's this concurrency and multiple
67:38
copies and sort of multiple sources of
67:40
writes since there's multiple front ends
67:42
it creates a lot of opportunity for not
67:47
just for there to be stale data but for
67:49
data stale data to be left in the system
67:52
for long periods of time and so I want
67:58
to I want to illustrate what are those
68:02
problems actually in a sense we've
68:03
already talked a bit about this when
68:05
somebody asked why the front ends don't
68:08
update why do they delete instead of
68:10
updating so that's certainly one
68:12
instance of the kind of weather
68:13
multiple sources of data and so we have
68:17
trouble enforcing correct order but
68:22
here's another example of a race an
68:25
update race that if they hadn't done
68:28
something about it would have left data
68:30
indefinitely
68:31
stale data and definitely in memcache
68:36
it's going to be a similar flavor to the
68:38
previous example so supposing we have
68:40
client one he wants to read a key but
68:47
memcache says it doesn't have the data
68:51
it's a Miss so C one's gonna read the
68:55
data from from the database and let's
69:02
say it gets back some value that you
69:07
want
69:09
meanwhile client to wants to update this
69:12
data so it sends you know its rates he
69:18
equals v2 and sends that to the database
69:22
and then you know the rule for writes
69:26
the code for writes that we saw is that
69:28
the next thing we do is delete it from
69:29
the database from memcache d.c c2 is
69:33
going to delete
69:35
ah the key from the database oh that's a
69:39
Friday you know it was actually c2
69:41
doesn't really know what's in memcache d
69:42
but the leading was ever there is always
69:44
safe because certainly not gonna cause a
69:48
stale data to be deleting won't cause
69:50
her to be stilled Leena um and this is
69:53
the sense that the paper claims that
69:55
delete is idempotent said delete it's
69:59
always safe to kabhi but if you recall
70:02
the pseudocode for what a read does if
70:05
you miss and you read the data from the
70:07
database you're supposed to just insert
70:09
that data into memcache so client 1 you
70:12
know may have been slow and finally gets
70:13
around to sending a set RPC two memcache
70:19
T but it read version 1 and read a you
70:21
know what is now an old outdated version
70:23
of the data from the database but it's
70:26
going to set that into
70:33
set this into memcache and yeah you know
70:35
one other thing that happened is that we
70:36
know the database is is whenever you
70:39
write something I'm database that sends
70:40
deletes to memcache D so of course maybe
70:42
at this point the database will also
70:44
have sent a delete
70:47
for k2m caste and so now we get to
70:51
deletes but it doesn't really matter
70:52
right these lease may already have
70:53
happened by the time client one gets
70:55
around to updating this key and so at
70:59
this point indefinitely the memcache D
71:03
will be cashing a stale version of of
71:06
this data and there's just no mechanism
71:08
anymore or this is them if the system
71:11
worked in just this way there's no
71:14
mechanism for the memcache D to ever see
71:16
to ever get the actual correct value
71:20
it's gonna store and serve up stale data
71:23
for key K forever
71:27
and they because they ran into this and
71:30
while they're okay with data being
71:32
somewhat out-of-date they're not okay
71:35
with data being out of date forever
71:37
because users will eventually notice
71:38
that they're seeing ancient data and so
71:42
they had to solve this they had to make
71:44
sure that this scenario didn't happen
71:49
they actually solved this this problem
71:51
also with the lease mechanism at the
71:54
same lease mechanism that we describe
71:56
for the Thundering hoard although
71:58
there's an extension to the lease
72:01
mechanism that makes this work so what
72:04
happens is that when memcache descends
72:06
back a Miss indication seeing the data
72:07
wasn't in the cache it's gonna grant the
72:10
lease so we get the Miss indication plus
72:13
this lease which is basically just a big
72:17
unique number and the memcache server is
72:19
gonna remember that the association
72:21
between this lease and this key it knows
72:23
that somebody out there with a lease to
72:27
update this key the new rule is that
72:31
when the when the memcache server gets a
72:34
delete from either another client or
72:36
from the database server the memcache
72:39
server is gonna as well as deleting the
72:42
item is going to invalidate this lease
72:44
so as soon as either these deletes come
72:46
in assuming that Elyse arrived first the
72:49
memcache server is gonna believe this
72:52
lease from its table about leases this
72:55
set
72:59
is the lease back from the front end now
73:02
when the set arrives the memcache server
73:05
will look at the lease and say wait a
73:06
minute I you don't have a lease for this
73:09
key all right invalid if these fit this
73:12
key I'm gonna ignore this set so because
73:14
the lease has been because one of these
73:16
if one of these deletes came in before
73:18
the set this sees to be invalid in
73:21
invalidated and the memcache server
73:24
would ignore this set and that would
73:27
mean that the key would note just stay
73:29
missing from memcache and the next
73:33
client that tried to read that key
73:34
you'll get a Miss would read the fresh
73:37
data now from the database and would
73:40
install it in memcache and presumably
73:41
the second time around
73:43
the second readers lease would be valid
73:46
um you may and indeed you should ask
73:49
what happened that the order is
73:51
different so supposing these deletes
73:53
instead of happening before the set
73:56
these deletes were instead to have
73:59
happen after the set I want to make sure
74:02
this scheme still works then and so how
74:05
things would play out then is that since
74:08
if these Dilys were late happened after
74:12
the set the memcache server wouldn't
74:14
delete these from its table of visas
74:17
Solis would still be there when the set
74:19
came and yes indeed we would still then
74:22
it would accept the setting we would be
74:25
setting key to a stale value but our
74:30
assumption was this time that the
74:31
deletes had been late and that means the
74:33
Dilys are yet to arrive and when they
74:35
when these deletes arrive then this stay
74:37
on theta will be knocked out of the
74:39
cache and so the stale date will be in
74:41
the cache a little bit longer but we
74:43
won't have this situation where stale
74:45
data is sitting in the cache
74:47
indefinitely and never deleted
74:52
any questions
74:54
lissa machinery okay um to wrap up you
75:07
it's certainly fair to view this system
75:10
a lot of the complexity of the system as
75:12
stemming from the fact that it was sort
75:14
of put together out of pieces that
75:17
didn't know about each other like it
75:20
would be nice for example memcached he
75:21
knew about the database I'm understand
75:24
memcache D and the database kind of
75:25
cooperated in a consistency scheme and
75:31
perhaps if Facebook could have at the
75:35
very beginning you know predicted the
75:38
how things would play out on what the
75:40
problems would be and if they have had
75:42
enough engineers to work on it they
75:44
could have from the beginning built a
75:45
system that could provide both all the
75:48
things they needed high-performance
75:50
multi data center replication partition
75:54
and everything and they're having
75:57
companies that have done that so the
75:59
example I know of that sort of most
76:01
directly comparable to the system in
76:06
this paper is that if you care about
76:08
this stuff you might want to look at it
76:10
is Yahoo's peanuts storage system which
76:16
in a sort of designed from scratch and
76:20
you know different different in many
76:23
details but it does provide multi-site
76:26
replication with consistency and good
76:30
performance so it's possible to do
76:32
better but you know all the issues are
76:34
present that's just had a more
76:37
integrated perhaps elegant set of
76:40
solutions
76:41
the takeaway so for us from this paper
76:44
one is that for them at least and for
76:47
many big operations caching is vital
76:50
absolutely vital for to survive high
76:54
load and the caching is not so much
76:57
about reducing latency it's much more
76:59
about hiding enormous load from
77:02
relatively slow
77:05
storage servers that's what a cache is
77:08
really doing for Facebook is hiding sort
77:11
of concealing almost all the load from
77:13
the database servers another takeaway is
77:18
that you always in big systems you
77:20
always need to be thinking about caching
77:23
versus control versus sorry partition
77:25
versus replication I mean you need ways
77:28
of either formally or informally sort of
77:32
deciding how much your resources are
77:34
going to be devoted to partitioning and
77:35
how much to replication and finally
77:40
ideally you'd be able to do a better job
77:42
in this paper about from the beginning
77:45
integrating the different storage layers
77:47
in order to achieve good consistency
77:51
okay that is all I have to say please
77:56
ask me questions if you have