字幕記錄 00:01 all right hello everyone I'm today going 00:07 to talk about this paper about how 00:10 Facebook uses memcache I'm in order to 00:15 handle enormous load the reason we're 00:19 reading this paper is that it's an 00:21 experience paper there's not really any 00:23 new concepts or ideas or techniques here 00:26 but it's kind of what a real live 00:32 company ran into when they were trying 00:36 to build very high capacity 00:38 infrastructure there's a couple of ways 00:40 you could read it one is a sort of 00:44 cautionary tale about what goes wrong if 00:47 you don't take consistency seriously 00:49 from the start another way to read it is 00:53 that it's an impressive story about how 00:55 to get extremely high capacity from 00:58 using mostly off-the-shelf software 01:03 another way to read it is that it's a 01:05 kind of illustration or the fundamental 01:08 struggle that a lot of setups face 01:11 between trying to get very high 01:13 performance which you do by things like 01:16 replication and how to give consistency 01:18 for which techniques like replication 01:20 are really the enemy and so you know we 01:24 can argue about whether we like their 01:26 design or we think it's elegant or a 01:28 good solution but we can't really argue 01:31 with how successful they've been so we 01:35 do need to take them seriously and for 01:37 me actually this paper which I first 01:40 read quite a few years ago it's been I 01:43 thought about it a lot and it's been 01:45 sort of a source of sort of ideas and 01:51 understanding about problems at many 01:55 points 01:57 all right so before talking about 02:01 Facebook proper you know they're an 02:03 example of a pattern that you see fairly 02:06 often or that many people have 02:07 experienced in which they're trying to 02:09 build a website to do something and you 02:11 know typically people who build websites 02:13 are not interested in building high 02:15 performance you know high performance 02:20 storage infrastructure they're 02:22 interested in building features that 02:24 will make their users happy or selling 02:26 more advertisements or something you 02:29 know so they they're not gonna start by 02:31 spending a main year of effort or a 02:33 whole lot of time building cool 02:35 infrastructure they're gonna start by 02:36 building features in that they'll sort 02:38 of make infrastructure better only to 02:41 the extent that they really have to 02:42 because you know that's the best use of 02:44 their time alright so a typical starting 02:47 scenario in a ways when a some website 02:51 is very small is you know there's no 02:54 point in starting with anything more 02:55 than just a single machine right you 02:57 know maybe you started you only have a 02:59 couple users sitting in front of their 03:01 browsers and you know they talk over the 03:05 internet here with your single machine 03:07 your single machine is gonna maybe run 03:09 the Apache web server now maybe you 03:16 write the scripts that produce web pages 03:20 using PHP or Python or some other 03:23 convenient easy to program sort of 03:27 scripting style language and Facebook 03:29 uses PHP you need to store your data 03:33 somewhere or you can just download sort 03:36 of standard database and Facebook happen 03:41 to use my sequel my school is a good 03:42 choice because it implements the sequel 03:45 query language is very powerful and acid 03:49 transactions provides durable storage so 03:51 this is like a very very nice set up I 03:54 am will take you a long way actually but 03:58 supposing supposing you get successful 04:00 you get more and more users you know 04:02 you're gonna get more and more load more 04:04 and more people gonna be viewing your 04:07 website and running whatever PHP stuff 04:09 you're with 04:10 site provides and so at some point 04:14 almost certainly the first thing that's 04:17 going to go wrong is that the PHP 04:19 scripts are gonna take up too much CPU 04:21 time that's usually the first bottleneck 04:25 people encounter if they start with a 04:26 single server so what you need is some 04:29 way to get more horsepower for your PHP 04:31 scripts and so that takes us to kind of 04:33 architecture number two for websites in 04:38 which you know you have lots and lots of 04:40 users right or more users than before 04:44 you need more CPU power for your PHP 04:47 scripts so you run a bunch of front end 04:50 servers whose only job is to run the web 04:53 servers that users browsers talk to and 04:57 these are called front end servers so 05:00 these are going to run a patch either 05:04 webserver and the PHP scripts now you 05:09 know you users are going to talk to 05:10 different servers at different times 05:11 maybe your users Quadra each other they 05:14 message each other they need to see each 05:16 other's posts or something 05:17 so all these front-end servers are going 05:19 to need to see the same back-end data 05:23 and in order to do that you probably 05:27 can't just stick at least for a while 05:28 you can just stick with one database 05:30 server so you gonna have a single 05:31 machine already my sequel that handles 05:35 all of the database all queries and 05:37 updates reads and writes from the front 05:40 end servers and if you possibly can it's 05:43 wise to use a single server here because 05:45 as soon as you go with two servers and 05:46 somehow your data over multiple database 05:49 servers like gets much more complicated 05:51 and you need to worry about things like 05:54 do you need distributed transactions or 05:56 how it has the PHP scripts decide which 05:58 database server to talk to and so again 06:01 you can get a long way with this second 06:03 architecture you have as much CPU power 06:05 as you like by adding more front end 06:07 servers and up to a point a single 06:11 database server will actually be able to 06:13 absorb the reason rights of many front 06:16 ends but you know maybe you're very 06:18 successful you get even more users and 06:22 so the question is what's gonna go wrong 06:24 next and typically what goes wrong next 06:27 is that the database server since you 06:30 can always add more CPU more web servers 06:33 you know what inevitably goes wrong is 06:35 that after a while the database server 06:37 runs out of steam okay so what's the 06:43 next architecture this is web 06:49 architecture 3 and the kind of standard 06:51 evolution of big websites here we have 06:55 the same if you know now thousands and 06:57 thousands of users lots and lots of 07:01 front ends and now we basically we know 07:04 we're gonna have to have multiple 07:07 database servers so now behind the front 07:10 ends we have a whole rack of database 07:14 servers each one of them running my 07:16 sequel again 07:18 but we're going to shard the data 07:21 we're driven now to sharding the data 07:22 over the database server so you know 07:25 maybe the first one holds keys you know 07:27 a through G & G through second one holds 07:32 keys G through Q and you know whatever 07:34 the charting happens to be and now the 07:37 front-end you know you have to teach 07:38 your PHP scripts here to look at the 07:40 data they need and try to figure out 07:42 which database server they're going to 07:43 talk to it you know in different times 07:45 for different data they're going to talk 07:46 to different servers so this is sharding 07:51 and of course the reason why this gives 07:53 you a boost is that now the all the work 07:57 of reading and writing has split up 07:58 hopefully hopefully evenly split up 08:01 between these servers since they hold 08:04 different data now replicas rating word 08:05 charting the data and they can execute 08:08 in parallel and have big parallel 08:09 capacity to read and write data it's a 08:14 little bit painful the PHP code has to 08:17 know about the sharding if you change 08:19 the setup of the database servers that 08:21 you add a new database server or you 08:23 realize you need to split up the keys 08:25 differently you know now you need a 08:27 you're gonna have to modify the software 08:29 running on the front ends or something 08:31 in order for them to understand about 08:33 how to cut over to the new sharding so 08:35 there's some there's some pain here 08:37 there's also if you need transactions 08:40 and you know many people use them if you 08:42 need transactions but the data involved 08:45 in a single transaction is on more than 08:47 one database server you're probably 08:49 going to need two-phase commit or some 08:51 other distributed transaction scheme 08:53 it's also a pain and slow all right well 08:59 you can you can get fairly far with this 09:03 arrangement however it's quite expensive 09:07 my sequel or sort of you know fully 09:10 featured database servers like people 09:12 like to use it's not particularly fast 09:14 it can probably 09:18 perform a couple hundred thousand reads 09:20 per second and far fewer rights and you 09:24 know web sites tend to be read heavy so 09:29 it's likely that you're gonna run out of 09:32 steam for reads before writes that 09:36 traffic will be that we load on the web 09:38 servers will be dominated by reads and 09:40 so after a while you know you can slice 09:43 the data more and more thinly over more 09:45 and more servers but two things go wrong 09:49 with that one is that the some sometimes 09:51 you're you have specific keys that are 09:54 hot that are used a lot and no amount of 09:57 slicing really helps there because each 09:59 key is only on a single server so that 10:02 keeps very popular that servers can be 10:03 overloaded no matter how much you 10:04 partition or shard the data and the 10:09 other problem with adding was shorting 10:10 adding lots and lots of my sequel 10:13 database servers for sharding is that 10:17 it's really an expensive way to go as it 10:20 turns out and after a point you're gonna 10:21 you're going to start to think that well 10:23 instead of spending a lot of money to 10:25 add another database server running my 10:27 sequel I could take the same server run 10:30 something much faster on it like as it 10:32 happens memcache D and get a lot more 10:35 reads per second out of the same 10:37 Hardware using caching than using 10:40 databases so the next architecture and 10:45 this is now starting to resemble what 10:48 Facebook is using the next architecture 10:51 still need users we still have a bunch 10:56 of front end servers running web servers 10:59 in PHP and by now maybe a vast number of 11:03 front end servers we still have our 11:05 database servers because you know we 11:06 need us a system that will store data 11:11 safely on disk for us and we'll provide 11:13 things like transactions for us and so 11:17 you know probably want a database for 11:19 that but in between we're gonna have a 11:21 caching layer that's this is where 11:23 memcache D comes in 11:26 and of course there's other things you 11:27 could use that the memcache but memcache 11:29 D happens to be an extremely popular 11:31 caching scheme the idea now is you have 11:34 a whole bunch of these memcache servers 11:36 and when a front-end needs to read some 11:41 data the first thing it does is ask one 11:45 of the memcache servers look do you have 11:47 the data I need 11:48 so it'll send a get request with some 11:50 key to one of the memcache servers and 11:54 the memcache server will check it's got 11:56 just a table in memory it's in fact 11:58 memcache is extremely simple it's far 12:00 far simpler than your lab 3 for example 12:04 it just has just as a big hash table on 12:07 memory it checks with that keys in the 12:09 hash table if it is it sends back the 12:11 data saying oh yeah here's the value 12:12 I've cashed for that and if we if the 12:15 front end hits in this memcache server 12:16 great I can then produce the webpage 12:19 with that data in it if it misses in the 12:21 webserver though the front-end has to 12:23 then rear equesticle irrelevant database 12:27 server and the database server will say 12:30 oh you know here's the here's the data 12:33 you need and at that point in order to 12:36 cash it in for the next front-end that 12:39 needs it the front end we'll send a put 12:42 with the data it fashion the database 12:44 into that memcache server and because 12:48 memcache runs at least 10 and maybe 12:50 maybe more than 10 times faster for 12:52 weeds than the database for a given 12:55 amount of hardware it really pays off to 12:58 use a fair amount some of that hardware 12:59 for memcache as well as for the database 13:02 servers so people people use this 13:05 arrangement a lot and it just saves them 13:06 money because memcache is so much faster 13:09 for weeds than a database server still 13:12 need to send writes to the database 13:13 because you want right to an updates to 13:15 be stored durably on the database as 13:20 this can still be there if there's a 13:22 crash or something but you can send the 13:26 Reese to the cache very much more 13:28 quickly ok so we have a question the 13:30 question is why wouldn't the memcache 13:32 server actually hit the put on behalf of 13:34 the front-end and cache the response 13:36 before responding the front-end so 13:37 that's a great question 13:39 you could imagine a caching layer that 13:41 you would send a get to it and it would 13:43 if it missed the memcache layer would 13:45 would forward the request to the 13:48 database babies respond the memcache 13:50 memcache would add the data to its 13:52 tables and then respond and the reason 13:55 for this is that memcache is like a 13:59 completely separate piece of software 14:00 that it doesn't know anything about 14:02 databases and it's actually not even 14:04 necessarily used and combined in 14:06 conjunction with the database although 14:08 it often is so we can't bake in 14:12 knowledge of the database into memcache 14:14 and sort of deeper reason is that the 14:18 front ends are often not really storing 14:22 one for one database records in memcache 14:25 almost always or very frequently what's 14:28 going on is that the front-end will 14:30 issue some requests to the database and 14:32 then process the results somewhat you 14:35 know maybe take a few steps to turning 14:37 it into HTML or sort of collect together 14:42 you know results from multiple careers 14:46 on multiple rows in the database and 14:47 cached partially processed information 14:49 in memcache 14:51 just to save the next reader from having 14:53 to do the same processing and for that 14:56 reason memcache it doesn't really does 14:59 not understand the relationship between 15:01 what the friends would like to see see 15:03 cached and how did you ride that data 15:06 from the database that knowledge is 15:07 really only in the PHP code on the front 15:09 end so therefore even though we could be 15:12 architectural a good idea we can't have 15:14 this integration here sort of direct 15:17 contact between memcache and the 15:20 database although it might make the 15:22 cache consistency story much more 15:25 straightforward and yes 15:27 this is it's this and answer the next 15:31 question that is the difference between 15:33 a lookaside cash and a look through cash 15:37 the fact the lookaside business is that 15:40 the front end sort of looks asides to 15:42 the cash to see if the data is there and 15:44 if it's not it makes its own 15:45 arrangements for getting the deed on 15:48 amiss you know a look through cash my 15:51 forward request of the database and 15:53 directly and handle the response now 15:57 part of the reason for the popularity in 15:59 memcache is that it is it is a lookaside 16:01 cash that is completely neutral about 16:04 whether there's a database or what's in 16:07 the database or the relationship between 16:09 stuff in memcache and what's in the end 16:11 items in the database all right so this 16:17 is very popular arrangement very widely 16:18 used it's cost effective because 16:21 memcache is so much faster in the 16:23 database it's a bit complex every 16:27 website that makes serious you so this 16:29 faces the problem that if you don't do 16:33 something the data that's stored in the 16:35 caches will get out of sync with the 16:37 data in the database and so everybody 16:40 has to have a story for how to make sure 16:42 that when you modify something in the 16:43 database you do something to memcache to 16:48 you know take care of the fact that 16:50 memcache may then be storing stale data 16:52 that doesn't reflect the updates and a 16:54 lot of this papers about what Facebook 16:56 story is for that although other people 16:58 had other plans this arrangements also 17:05 potentially a bit fragile it allows you 17:08 to scale up to far more users then you 17:11 could have gone with databases alone 17:13 because memcache is so fast but what 17:15 that means is that you're gonna end up 17:17 with the system that's sustaining a load 17:20 that's far far higher you know orders of 17:24 magnitude higher than what the databases 17:25 could handle and thus if anything goes 17:29 wrong for example if one of your 17:30 memcache servers were to fail and 17:34 meaning that the front ends would now 17:36 have to contact the database because 17:37 they didn't hit they couldn't use this 17:40 to store data 17:41 you're gonna be increasing a load in the 17:42 databases dramatically right because 17:45 memcache do you know supposing it has a 17:47 you know hit rate of 99 percent or 17:50 whatever it happens to be you know 17:53 memcache is gonna be absorbing almost 17:55 all the reads the database backends only 17:58 going to be seeing a few percent of the 18:00 total reads so any failure here is gonna 18:04 increase that few percent of the reads 18:06 to maybe you know I don't know 50 18:07 percent of the reads or whatever which 18:09 is a huge huge order of magnitude 18:11 increase so as Facebook does once you've 18:16 got to rely on this caching layer you 18:18 need to be set up pretty serious 18:24 measures to make sure that you never 18:26 expose the database layer to the full 18:30 anything like the full load that the 18:35 caching layer is seeing and you know you 18:37 see in facebook they have quite a bit of 18:40 thought put into making sure the 18:43 databases don't ever see anything like 18:44 the full load okay 18:48 so far this is generic now I want to 18:54 sort of switch to a big picture of what 18:58 Facebook describes in the paper for 19:00 their overall architecture of course 19:04 they have lots of users every user as a 19:05 friendless and status and posts and 19:08 likes and photos but Facebook's very 19:13 easy or e nted towards showing data to 19:16 users and a super important aspect of 19:21 that is that fresh data is not 19:24 absolutely necessary in that 19:26 circumstance you know suppose the reads 19:31 are you know due to caching supposed to 19:34 reads yield data that's a few seconds 19:36 out of date so you're showing your users 19:38 data not the very latest data but the 19:40 data from a few seconds ago you know 19:41 what the users are extremely unlikely to 19:45 notice except in special cases right if 19:48 I'm looking at a news feed of today's 19:50 you know today's news you know if I see 19:53 the news from a few 19:54 times ago versus the news from now a big 19:57 deal nobody's gonna notice nobody's 20:00 gonna complain you know that's not 20:01 always true for all data but for a lot a 20:03 lot of the data that they have to deal 20:05 with sort of super up-to-date 20:07 consistency in the sense of like 20:08 linearise ability is not actually 20:11 important what is important is that you 20:14 don't cache stale data indefinitely you 20:17 know what they can't do is by mistake 20:20 have some data that they're showing 20:22 users that's from yesterday or last week 20:25 or even an hour ago those users really 20:28 will start to notice that so they don't 20:32 care about consistency like 20:34 second-by-second but they care a lot 20:36 about not not being in cannot chewing 20:41 stale data from more than well more than 20:44 a little while ago the other situation 20:47 in which they need to provide 20:48 consistency is if a user updates their 20:51 own data or if a user updates almost any 20:54 data and then reads that same data that 20:56 the human knows that they just updated 20:59 it's extremely confusing for the user to 21:02 see stale data if they know they just 21:04 changed it and so in that specific case 21:07 the Facebook design is also careful to 21:11 make sure that if a user changes data 21:14 that that user will see the change data 21:19 ok so Facebook has multiple data centers 21:23 which they call regions and I think at 21:27 the time this paper was written they had 21:30 two regions their sort of primary region 21:33 was on the west coast California and 21:36 their sort of secondary region was in 21:40 the East Coast and the two data centers 21:43 look pretty similar you 21:51 set of database servers running my 21:52 sequel the sharted date over these my 21:56 sequel database servers 21:59 they had a bunch of memcache D servers 22:03 which we'll see they are actually 22:05 arranged in independent clusters and 22:07 then they had a bunch of front ends 22:09 again sort of a separate arrangement in 22:15 each data center and there's a couple 22:20 reasons for this one is that their 22:22 customers were scattered all over the 22:24 country and it's nice just for a 22:27 performance that people on the East 22:28 Coast can talk to a nearby data center 22:29 and people on the west coast can also 22:31 talk to a nearby deficit it just makes 22:33 internet delays less now the the data 22:42 centers were not symmetric each of them 22:44 held a complete copy of all the data 22:46 they didn't sort of shard the data 22:47 across the data centers so the West 22:50 Coast I think was a primary and it sort 22:54 of had the real copy of the data and the 22:55 East Coast was a secondary and what that 23:00 really means is that all rights had to 23:01 be sent to the relevant database and the 23:05 primary day to be Center so you know any 23:08 right gets sent you know here and they 23:12 use a feature of my sequel they serve 23:15 asynchronous log replication scheme to 23:17 have each database in the primary region 23:21 send every update to the corresponding 23:24 database in secondary region so that 23:26 with a lag of maybe even a few seconds 23:29 these database servers would have 23:31 identical content the secondary database 23:33 servers would have identical content to 23:35 the primaries reads though we're local 23:38 so these front ends when they need to 23:39 find some data I'm in general would talk 23:41 to memcache memcache in that data center 23:44 and if they missed in memcache they 23:46 talked to the they'd read from the 23:48 database in that same data data center 23:52 um again though the databases are 23:56 complete replicas all the data's on both 23:59 of these these in both of these regions 24:04 that's the overall picture the next 24:10 thing I want to talk about is a few 24:11 details about how they how they use you 24:18 know with this leukocyte caching 24:21 actually looks like so there's really 24:25 there's reads and writes and this is 24:28 just what's shown in Figure two for a 24:31 read 24:33 which is executing on a front-end the 24:37 first thing if you read any data that 24:39 might be cached the first thing that 24:40 code in the front-end does is makes this 24:43 get library call with the key of the 24:46 data they want and get just generates an 24:48 RPC to the relevant memcache server so 24:51 they hash this library routine hashes on 24:56 the client hashes the key to pick the 24:57 memcache server and sends an RPC to that 25:01 mcat server them casually reply yes 25:04 here's your data or or maybe it'll point 25:06 nil saying I don't have that data it's 25:10 not cached so if if V is nil then the 25:18 front-end will issue whatever sequel 25:22 query is required to fetch the data from 25:27 the database and then make another RPC 25:32 call 25:36 - memk2 the relevant memcache server to 25:39 install the fetch data in the memcache 25:41 server so this is just the routine I 25:44 talked through before 25:45 it's kind of what lookaside caching does 25:48 and for right 25:55 you know V is the writing we have a key 25:59 and a value no Rana right and so library 26:01 routine on an each front end we're gonna 26:05 send the the new data to the database 26:11 and you know I as I mentioned before the 26:15 Keene the value may be a little bit 26:16 different you know what's stored in the 26:17 database is often in a somewhat 26:19 different form from what's stored in 26:21 memcache see but we'll imagine for now 26:23 the same and once the database has the 26:26 new data then the right library routine 26:29 sends an RPC to memcache detailing it 26:34 look you got to delete this key so I 26:38 want to write the writer is invalidating 26:41 the key in memcache do you know what 26:43 that means is that the next front-end 26:47 that tries to read that key from 26:49 memcache D is gonna get nil back because 26:52 it's no longer cached and will fetch the 26:54 updated value from the database and 26:56 install it into memcache all right so 27:01 this is an invalidation in particular 27:04 it's not you could imagine a scheme that 27:06 would send the new data to memcache T at 27:08 this point but it doesn't actually do 27:10 that instead of gliese it and actually 27:13 in the context of facebook scheme the 27:17 real reason why this delete is needed is 27:21 so that 27:24 we'll see their own rights because in 27:26 fact in their scheme the mem cat the my 27:29 sequel server the database servers also 27:32 send deletes one of you and the front 27:35 end writes something in the database the 27:37 database with the mix squeal mechanism 27:39 the paper mentions well send the 27:42 relevant deletes to the memcache servers 27:44 that that might hold this key so the 27:47 data the database servers will actually 27:49 invalidate stuff in memcache by-and-bye 27:51 may take them a while um but because 27:54 that might take a while the front ends 27:57 also delete the key said that a front 27:58 end won't see a stale value for data 28:04 that it just updated okay 28:16 all sort of the background of this is 28:21 pretty much how everybody uses memcache 28:23 G there's nothing yet really very 28:24 special here now eventually you know the 28:28 paper is all about on the surface all 28:31 about solving consistency problems and 28:34 indeed those are important but the 28:36 reason why they got where they ran into 28:39 those consistency problems is in large 28:41 part because they you know modify the 28:46 design or set up a design that had 28:47 extremely high performance because they 28:49 had extremely high load and say they 28:51 were desperate to get performance and 28:54 kind of struggled along behind the 28:56 performance improvements in order to 28:59 retain a reasonable level of consistency 29:02 and because the performance kind of came 29:04 first for them I'm actually going to 29:05 talk about 29:07 their performance architecture before 29:10 talking about how they fix the 29:14 consistency okay sorry there's been a 29:18 bunch of questions here that I haven't 29:21 seen let me take a peek okay so one 29:28 question this means that the replicated 29:30 updates from the primary my sequel 29:32 database to the secondary must also 29:34 issue deletes - yeah so this is I think 29:39 a reference to the previous or 29:40 architecture slide the observation is 29:43 that yes indeed when a front-end sends a 29:47 write to the database server today every 29:49 server updates its data on disk and it 29:53 will send an invalidate a delete to 29:56 whatever memcache server there is in the 29:58 local region the local data center that 30:00 might have had the key that was just 30:02 updated the database server also sends a 30:05 sort of representation of the update to 30:07 the corresponding database serve in the 30:09 other region which process it applies 30:11 the right to its disk data on disk it 30:15 also using them excuse sort of log 30:18 reading apparatus figures out which 30:22 memcache server might hold the key that 30:25 was just updated and sends it delete 30:27 also to that memcache server so that the 30:32 if it's the key is cache is invalidated 30:34 in in both data centers okay so another 30:40 question what would happen if we delete 30:41 first in the right and then send to the 30:45 database 30:48 so that's or with reference to this 30:50 thing here would what if we did to the 30:53 feet first you know if you do delete 30:58 first then you're increasing the chances 31:01 that some other clients so supposing you 31:03 delete and then send to database right 31:11 in here if another client reads that 31:13 same key they're gonna miss at this 31:14 point they're gonna fetch the old data 31:18 from the database and they're gonna then 31:20 insert it 31:21 cash and then you're going to update it 31:22 leaving memcache for a while at least 31:25 with stale data and then if this the 31:28 writing client reads it again it may see 31:30 the stale data even though it just 31:31 updated it doing the delete second um 31:34 you know these over the possibility that 31:37 somebody will read during this period of 31:40 time and see steal data but they're not 31:43 worried about stale data in general 31:44 they're really most worried in this 31:47 context about clients reading their own 31:50 rights so on balance even though there's 31:52 a consistency problem by the way I'm 31:56 doing the delete second ensures that 32:00 clients will be their own rights in 32:02 either case eventually the database 32:06 server as I'm just mentioned will send a 32:09 delete for the written keys 32:19 another question I'm confused on how 32:21 writing the new value shows stale data 32:24 but deleting doesn't let me see I'm not 32:40 really sure what the question is asking 32:42 the if it's with reference to this code 32:47 once the writes done okay maybe the 32:53 question is it's really we didn't do 32:54 delete at all so that when a client a 32:59 front web for an end did or wanted to 33:02 update him David would just tell the 33:03 database but not explicitly delete the 33:07 data from memcache the problem with this 33:10 is that if the client sent this write to 33:17 the database and then immediately read 33:19 the same data that read would come out 33:22 of the memcache and because memcache 33:24 still has the old data 33:26 you know memcache hasn't seen this right 33:27 yet a client that updated some data and 33:30 then read it you know updates it in the 33:32 database but it reads the data if the 33:34 stale data from memcache and then a 33:36 client might update some data but still 33:38 see the old data and if you delete it 33:41 from memcache then if a client if you do 33:44 do this delete than a client that writes 33:46 some data and deletes it from memcache 33:49 and then reads it again it'll miss in 33:51 memcache because of the delete and they 33:54 don't have to go to the database and 33:55 read the data and the database will give 33:56 it fresh data okay so the question is 34:03 how come why do we delete here 34:06 gosh why don't we just instead of this 34:10 delete have the client just directly 34:13 since it knows the new data just send a 34:18 set RPC - memcache T and this is a this 34:24 is a good question 34:24 and so here we're doing I have an 34:26 invalidate scheme this would often be 34:29 called an update scheme and let me try 34:34 to cook up an example that shows that 34:37 while this could probably be made to 34:39 work this update scheme it doesn't work 34:45 it doesn't work out of the box and you 34:47 wouldn't you need to do some careful 34:48 design in order to make it work so this 34:50 wasn't client wants posing up now we 34:51 have two clients 34:57 reading and writing the same key 34:59 interleaved so let's say client one 35:05 tells the database you know sends X plus 35:11 plus to the database right just 35:15 incrementing X and then of course or let 35:19 me say it's going to increment X from 35:22 zero to one so set X to one and then 35:25 after that client one is going to call 35:30 set of our key which is X and the value 35:36 one and write that at the memcache D 35:40 supposing meanwhile client two also 35:43 wants to increment X so it's going to 35:46 read this latest value in the database 35:48 and almost certainly these are in fact 35:50 transactions so what if we were doing 35:54 increment what client won't won't be 35:55 sending would be some sort of increment 35:57 transaction on the database for 35:58 correctness because the database does 36:00 support transactions so we're going to 36:02 imagine the client to increments the 36:04 value of x to to sends that increment to 36:07 the database and client two also is 36:09 going to do this set so it's going to 36:12 set X to be two but now what we're left 36:15 with is the value of one in memcache D 36:19 even though the correct values and the 36:21 databases to which is to say if we do 36:25 this update was set even though it does 36:28 save us some time right cuz now we're 36:29 saving somebody a miss in the future 36:31 because we directly said instead of 36:33 delete we also run the risk if the if 36:36 it's popular data of leaving stale data 36:39 in the database it's not that you 36:41 couldn't get this to work somehow but it 36:45 does require some careful thought to fix 36:49 this problem all right so that was why 36:55 they use invalidate and instead of 36:58 update okay so I was going to 37:03 about performance they this sort of 37:07 route of how they get performance is 37:09 through parallel parallelization 37:12 parallel execution and for a storage 37:16 system just at a high level there's 37:18 really two ways that you can get a good 37:21 performance 37:22 one is by partition which is sharding 37:27 that is you take your data and you split 37:30 it up over you know into ten pieces over 37:32 ten servers and those ten servers can 37:34 run independently hopefully the other 37:37 way you can use extra hardware to get 37:39 higher performance despite replication 37:42 at is have more than one copy of the 37:48 data and you kind of for a given amount 37:51 of hardware you can kind of choose 37:53 whether to partition your data or 37:55 replicate it in order to use that 37:56 hardware and there's you know from 38:04 memcache see what we're talking about 38:05 here is is splitting the data over the 38:09 available memcache servers by hashing 38:11 the key so that every key sort of lives 38:14 on one memcache server and from memcache 38:16 what we would be talking about here is 38:18 having each front-end just talk to a 38:21 single memcache server and send all its 38:23 requests there so that each memcache 38:25 server serves only a subset of the front 38:29 ends and sort of serves all their needs 38:31 and Facebook actually uses a combination 38:35 of both partition and replication for 38:39 partition the things that are in its 38:41 favor one is that it's memory efficient 38:44 because you only store a single copy of 38:48 each item Abita where's in replication 38:50 you're gonna store every piece of data 38:52 maybe on every server on the sort of 39:00 partition is that it's as long as your 39:03 keys are sort of equally roughly equally 39:05 popular works pretty well but if there's 39:07 some hot a few hot keys partition 39:10 doesn't really help you much once you 39:12 get those partition enough that those 39:13 hot keys are on different servers you 39:17 know once the if there's a single hot 39:19 key for example no amount of 39:20 partitioning helps you because no matter 39:22 of how much you partition that hot key 39:25 is still sitting on just one server 39:34 the problem partition is that it doesn't 39:38 mean that the front if front ends need 39:40 to use lots of data lots of different 39:42 keys it means in the end each front-end 39:44 is probably going to talk to lots of 39:47 partitions and at least if you use 39:51 protocols like TCP that keep state 39:53 there's significant overhead to as you 39:58 add more and more sort of N squared 40:00 communication for a replication it's 40:09 fantastic if if your problem is that a 40:12 few keys are popular because now you 40:17 know you're making replicas of those 40:18 those hotkeys and you can serve each 40:20 replica the same key in parallel it's 40:24 good because there's fewer this there's 40:26 not n squared communication each 40:28 front-end maybe only talks to one 40:30 memcache server but the bad thing is 40:38 it's there's a copy of data in every 40:40 server you can cache far fewer distinct 40:43 data items with replication then with 40:47 partition so there's less total data can 40:53 be stored so these are just generic for 40:57 pros and cons of these two main ways of 41:00 using extra hardware to get higher 41:02 performance alright so I want to talk a 41:07 bit about there when one sort of context 41:11 in which they use partition and 41:12 replication is at the level of different 41:15 regions so I just want to talk through 41:21 why it is that they decided to have 41:26 separate regions and kind of separate 41:28 complete data center with all the data 41:30 in each of the regions so I before I do 41:33 that there's a question why can't we 41:35 cache the same amount of data with 41:38 replication ok so supposing you have 10 41:42 machines each with a gigabyte of RAM and 41:46 you can use these 10 machines each with 41:48 a gigabyte of RAM for either replication 41:50 or in a partitioning scheme if you use a 41:53 partitioning scheme where each server 41:56 stores different data from the other 41:58 servers that you can store a total of 10 42:01 gigabytes of distinct data objects on 42:04 your 10 servers each with a gigabyte of 42:07 RAM so with partition you know each byte 42:09 of ram is used for different data so you 42:11 can look at the total amount of RAM you 42:12 have that's how much distinct data you 42:15 know different data items you can store 42:16 with replication you know assuming your 42:20 users are more or less looking at the 42:22 same stuff each each replicas each cache 42:29 replicas will end up storing roughly the 42:32 same stuff as all the other caches so 42:36 your 10 you have 10 gigabytes of RAM 42:38 still they and your 10 machines but each 42:41 of those machines stores roughly the 42:42 same data so would you end up with this 42:44 10 copies of the same gigabyte of items 42:48 so in a can this particular example if 42:51 you use replication you 42:52 snoring attempt as many distinct data 42:54 items and you know that may actually be 42:57 a good idea depending on you know sort 43:01 of way your data is like but it does 43:04 mean that replication gives you less 43:07 total data that's cached and you know 43:09 you can see there's points in the paper 43:11 word that they mention this tension 43:14 nominally they don't come down on one 43:18 side of the other because they use both 43:19 replication and charting okay okay so 43:27 the highest level at which they're 43:30 playing this game is between regions and 43:33 so it at this high level each region has 43:36 a complete replica of all the data right 43:38 they have a each region as a complete 43:41 set of database servers each database 43:42 database corresponding database servers 43:45 for the same data and assuming users are 43:47 looking at more or less the same stuff 43:49 that means the memcache servers in the 43:52 different regions are also storing more 43:55 or less basically replicating where we 43:57 have yours replicating in both the 43:59 database servers and the memcache 44:00 servers and the point again one point is 44:04 to you want a complete copy of the site 44:09 that's close to West Coast users in the 44:11 internet load early in the internet and 44:13 another copy of the complete website 44:16 this close to users on the East Coast 44:18 close on the internet again and the 44:22 Internet's pretty fast but coast to 44:23 coast is you know 50 milliseconds or 44:27 something which if you do if users have 44:30 to wait too many 50 millisecond 44:31 intervals they'll start to notice that 44:33 amount of time another reason is that 44:36 the you wanna a reason to 44:42 applicate the data between the two 44:44 regions is that these front ends to even 44:48 create a single web page for user 44:50 requests often dozens or hundreds of 44:53 distinct data items from the cache or 44:55 the databases and so the speed the 44:58 latency the delay at which a front-end 45:00 can fetch these hundreds of items from 45:03 that from the look from the memcache key 45:05 is quite important and so it's extremely 45:07 important to have the front and only 45:10 talk to only read local memcache servers 45:15 and local databases so that you can do 45:17 the hundreds of queries it needs to do 45:19 for a web page very rapidly so if we 45:21 have partitioned the data between the 45:23 two regions then a front-end you know if 45:27 I'm looking at my friends and some of my 45:28 friends are on the East Coast and some 45:29 on the west coast that means if we 45:31 partitioned that would might require the 45:33 front ends to actually make many 45:37 requests you know 50 milliseconds each 45:39 to the other data center and users would 45:45 users would see this kind of latency and 45:49 be very upset so so the reason to 45:52 another reason to replicate is to keep 45:53 the front ends always close to the data 45:56 to all the data they need of course this 46:00 makes writes more expensive because now 46:01 if a front-end and the secondary region 46:03 needs to write in estes send the data 46:05 all the way across the internet the 46:07 reads are far far more frequent than 46:10 right so it's a good trade-off although 46:13 the paper doesn't mention it it's 46:15 possible that another reason for 46:18 complete replication between the two 46:20 sites is so that if the primary site 46:23 goes down perhaps they could switch the 46:26 whole operation to the secondary site 46:27 but I don't know if they had that in 46:29 mind 46:34 okay so this is the story between 46:38 Regents is basically a story of 46:40 replication between the two data centers 46:48 all right now within a data center 46:51 within a region so in each region 47:00 there's a single set of database servers 47:07 so at the database level the data is 47:11 charted and not replicated inside each 47:14 region however at the memcache level 47:19 they actually use replication as well as 47:21 charting so they had this notion of 47:22 clusters so a given regions actually 47:26 supports multiple clusters of front-ends 47:29 and database servers so here I'm going 47:31 to have two clusters in this region this 47:34 cluster has a you know a bunch of front 47:35 ends and a bunch of memcache servers and 47:40 these are completely independent almost 47:45 completely independent so that a 47:46 front-end and cluster one sends all its 47:49 reads to the local memcache servers and 47:51 misses it needs to go to the one instead 47:54 of database servers and similarly each 47:57 front-end in this cluster 48:02 talks only to memcache servers in the 48:06 same cluster so why do they have this 48:11 multiple clusters why not just have you 48:15 know essentially a single cluster a 48:16 single set of front end servers and a 48:18 single set of memcache server is shared 48:20 by all those front ends one is that if 48:24 you did that and and that would mean you 48:26 know if you need to scale up capacity 48:27 you sort of be adding more and more 48:29 memcache servers in front ends to the 48:31 same cluster you don't get any win 48:36 therefore in performance for popular 48:38 Keys you know so there the data sort of 48:43 this memcache service is sort of a mix 48:44 you know most of it is maybe only used 48:46 by a small number of users but there's 48:48 some stuff there that lots and lots of 48:49 users need to look at and by using 48:52 replication as well as sharding they get 48:55 you know multiple copies of the very 48:57 popular keys and therefore they get sort 49:00 of parallel serving of those keys 49:02 between the different clusters another 49:07 reason to not want to increase the size 49:11 of the cluster individual cluster too 49:13 much is that all the data within a 49:16 cluster is spread over partitioned over 49:19 all the memcache servers and any one 49:21 front end is typically actually going to 49:23 need data from probably every single 49:26 memcache server eventually and so this 49:30 means you have a sort of n-squared 49:31 communication pattern between the front 49:33 ends and the memcache servers and to the 49:38 extent that they're using TCP for the 49:39 communication that involves a lot of 49:41 overhead a lot of sort of connection 49:44 state for all the different TCP so they 49:46 wanted to limit so you know this is N 49:50 squared CCP's they want to limit the 49:55 growth of this and the way to do that is 49:57 to make sure that no one cluster gets to 49:58 be too big so this N squared doesn't get 50:01 too large 50:09 and well related to that is this in 50:13 caste congestion business they're 50:14 talking about the if a frontman needs 50:17 data from lots of memcache servers it's 50:20 actually it's gonna send out the 50:21 requests more or less all at the same 50:23 time and that means this front-end is 50:25 gonna get the responses from all the 50:26 memcache servers to query it more or 50:28 less the same side time and that may 50:30 mean dozens or hundreds of packets 50:32 arriving here all at the same time which 50:34 if you're not careful we'll cause packet 50:36 losses that's in caste congestion and in 50:41 order to limit how bad that was that you 50:43 had a bunch of techniques they talked 50:44 about but one of them was not making the 50:47 clusters too large so that the number of 50:49 memcache has given front-end tend to 50:51 talk to and they might be contributing 50:53 to the same caste never got to be too 50:55 large and a final reason the paper 50:59 mentions is that it's or behind this is 51:02 is a big network in the data center and 51:04 it's hard to build networks that are 51:08 both fast like many bits per second and 51:11 can talk to lots and lots of different 51:13 computers and by splitting the data 51:16 center up into these clusters and having 51:19 most of the communication go on just 51:20 within each cluster that means they need 51:22 a smaller they need you know a modest 51:25 size fast Network for this cluster and a 51:27 modest size you know reasonably fast 51:29 network for this cluster but they don't 51:30 have to build a single network that can 51:32 sort of handle all of the traffic 51:34 between among all the computers of the 51:37 giant cluster so it limits how expensive 51:41 underlying network is on the other hand 51:44 of course they're replicating the data 51:46 and the two clusters and for items that 51:50 aren't very popular and aren't really 51:51 going to benefit from the performance 51:53 win of having multiple copies this it's 51:58 wasteful to sit on all this RAM and you 52:00 know we're talking about hundreds or 52:01 thousands of servers so the amount of 52:03 money they spent on RAM for the memcache 52:05 services is no joke so in addition to 52:10 the pool of memcache servers inside each 52:13 cluster there's also this regional pool 52:17 of memcache servers that's 52:20 shared by all the clusters in a region 52:23 and into this regional pool they then 52:28 modify the software on the front end so 52:30 that the software on the front end knows 52:32 aha 52:32 this key the data for this skis actually 52:35 not use that often instead of storing it 52:38 on a memcache server my own cluster I'm 52:41 going to store this not very popular key 52:43 in the appropriate memcache server of 52:47 the regional pool so this is 52:55 the regional pool and this is just sort 53:00 of an admission that some data is not 53:02 popular enough to want to have lots of 53:04 replicas of it they can save money by 53:06 only cashing a single copy all right so 53:14 that's how they get that's this kind of 53:15 Carol replication versus partitioning 53:18 strategy they use inside each inside 53:22 each region a difficulty they had that 53:26 they discuss is that when they want to 53:28 create a new cluster in a data center 53:32 they actually have a sort of temporary 53:34 performance problem as they're getting 53:36 that cluster going so you know supposing 53:38 they decide to install you know couple 53:41 hundred machines to be a new cluster 53:42 with the front end new front ends new 53:44 memcache errors and then they fire it up 53:47 and you know maybe cause half the users 53:50 to start using the new cluster I'm gonna 53:52 have to use the old cluster well in the 53:55 beginning there's nothing in these 53:56 memcache servers and all the front end 53:59 servers are gonna miss on the memcache 54:00 servers and have to go to the databases 54:04 and at least at the beginning until 54:06 these memcache service gets populated 54:08 with all the sort of data that's used a 54:10 lot this is gonna increase the load on 54:12 the database servers absolutely enormous 54:15 leap because before we added the new 54:18 clusters maybe the database servers only 54:19 saw one percent of the reads because 54:22 maybe these memcache servers have a hit 54:24 rate of say 99 percent for reads the 54:27 only one percent of all that means go to 54:28 the database servers before we added the 54:30 new cluster if we add a new cluster with 54:33 nothing in the memcache servers and send 54:35 half the traffic to it it's gonna get a 54:37 hundred percent miss rate initially 54:39 right and so that'll mean you know we 54:44 gone from and so the overall miss write 54:46 will now be 50 percent so we've gone 54:48 from these database servers serving one 54:51 percent of the reads to them serving 50 54:54 percent of the reads so at least in this 54:56 imaginary example we've been quite 54:58 firing up this new cluster we may 55:00 increase the load on the databases by a 55:02 factor of 50 and chances are the 55:05 database servers were running you know 55:07 reasonably Coast's the capacity and 55:09 certainly not a factor of 50/50 under 55:12 capacity and so this would be the 55:14 absolute end of the world if they just 55:17 fired up a new cluster like that and so 55:20 instead they have this cold start idea 55:26 in which a new cluster is sort of marked 55:29 by some flag somewhere as being in this 55:33 cold start state and in that situation 55:38 when a front end and the new cluster 55:40 misses that actually first first it has 55:45 its own local memcache if that says no I 55:48 don't have the data then the front end 55:49 we'll ask the corresponding memcache in 55:51 another cluster in some warm cluster 55:54 that already has the data for the data 55:56 if it's popular data chances are it'll 55:58 be cached my friend and will get its 56:01 data and then it will install it in the 56:05 local memcache and it's only if both 56:08 local memcache and the warm memcache 56:10 don't have the data that this is front 56:13 end and the new cluster will read from 56:15 the database servers and so this is it 56:20 and so they run in this kind of cold 56:22 mode for a little while the paper I 56:24 think mentions a couple hours until the 56:25 memcache servers source and the new 56:28 clusters start to have all the popular 56:30 data and then they can turn off this 56:32 cold feature and just use the local 56:35 cluster memcache alone it's alright 56:42 so another another load problem that the 56:47 paper talks about if they ran into and 56:49 this is a load problem again deriving 56:52 from this kind of look aside caching 56:55 strategies is called the thundering herd 56:59 and the the scenario 57:06 is that supposing we have some piece of 57:09 data there's lotsa memcache servers but 57:12 there's some piece of data stored on 57:13 this memcache server there's a whole 57:16 bunch of front ends that are ordinarily 57:20 reading that one piece of very popular 57:23 data so they're all sending constantly 57:25 sending get requests for that data the 57:27 memcache server has it in the cache it 57:29 answers them and you know their memcache 57:31 server is conserve like millions to 57:33 million requests per second so we're 57:36 doing pretty good and of course there's 57:39 some database server sitting back here 57:40 that has the real copy of that data but 57:42 we're not bothering it because it is 57:43 cached well suppose some front-end comes 57:46 along and modifies this very popular 57:49 data so it's going to send a write to 57:51 the database with the new data and then 57:53 it's gonna send a delete to the memcache 57:57 server because that's the way rights 57:58 work so now we've just deleted this 58:00 extremely popular data we have all these 58:02 front ends constantly sending gets for 58:04 that data they're all gonna miss all at 58:08 the same time they're all gonna now 58:13 having missed send a read request to the 58:17 front end database all at the same time 58:19 and so now this front-end database is 58:22 faced with maybe dozens or hundreds of 58:23 simultaneous requests for this data so 58:25 the Loews here is gonna be pretty high 58:27 and it's particularly disappointing 58:30 because we know that all these requests 58:33 are for the same key so the database is 58:35 going to do the same work over and over 58:36 again to respond with the latest written 58:39 copy of that key until finally the front 58:46 ends get around to installing the new 58:48 key in memcache and then people start 58:50 hitting again and so this is the 58:53 Thundering hurt what we'd really like is 58:55 a single you know if a miss if there's a 58:58 right and the leads and a miss happens 59:01 in memcache we'd like what we'd like is 59:03 the for the first front end that misses 59:05 to fetch the data and install it and for 59:07 the other front ends just like take a 59:09 deep breath then wait until the new data 59:11 is cached and that's 59:15 just what their design does if you look 59:17 at the if this thing called Elise which 59:21 is different from the pieces were used 59:26 to but they call Elise and we start from 59:30 scratch in the scenario again let's see 59:33 so now suppose we have a popular piece 59:35 of data the first front end that asks 59:40 for a data that's missing memcache Devo 59:44 will send back an error saying oh no I 59:46 don't have the data in my cache but it 59:47 will install Elise which is a bit unique 59:51 number it'll pick a least number install 59:55 it in a table and the send this lease 59:57 token back to the front end and then 60:01 other front ends that come in and ask 60:03 for the same Keith they'll simply get a 60:05 just be asked to wait you know a quarter 60:10 of a second or whatever some reasonable 60:12 amount of time by the memcache D because 60:13 the memcache key will see a haha I've 60:15 already issued the lease for that key 60:16 now there's at least potentially a V 60:18 Sparky the server will notice it's 60:21 already issued at least for the can tell 60:22 these ones to wait so only one of the 60:26 server's guess Elise this server then 60:28 asks for the data from the database when 60:33 against the responds back mom 60:35 then it sends the put for the new data 60:40 with a key and the value of God and the 60:43 least proved that it was the one who was 60:45 allowed to write the data memcache 60:46 people looking for these today aha yeah 60:48 you are the the person whose lease was 60:51 granted and it'll actually do the 60:52 install by and by these other friends 60:55 who are told the wait will reissue their 60:57 reads now that it will be there and so 61:00 we all if all goes well get just one 61:03 request to the database instead of 61:05 dozens or hundreds and I think it's the 61:08 sense in which is the lease is if the 61:10 front-end fails at an awkward moment and 61:14 doesn't actually request the data from 61:16 the database or doesn't get around it 61:17 installing it memcache D eventually 61:19 memcache D will delete the lease cuz it 61:21 times out and the next front end to ask 61:23 will get a new lease and will hope that 61:26 it will talk to the database and install 61:28 new data so yes they answer the question 61:32 the lease does up a time out in case the 61:35 first front end fails yes yes okay so 61:41 these leases are the their solution to 61:43 the Thundering Herd problem um another 61:48 problem they have is that if one of 61:49 these memcache servers fails the most 61:54 natural you-know-whats if they don't do 61:56 anything special if the memcache server 61:58 fails the front ends will send a request 62:00 they'll get back a timeout and network 62:02 will say jeez that you know I couldn't 62:04 contact that host never got a response 62:05 and what the real I BRE 62:09 software does is it then sense it 62:10 requests the database so if a memcache 62:13 server fails and we don't do anything 62:14 special the database is now going to be 62:17 exposed directly to the reads all of 62:19 these reefs and I'm catch server West 62:21 serving this is the memcache server may 62:22 well have been serving you know a 62:23 million reads per second that may mean 62:26 that the database server would be then 62:29 exposed to those million reads per 62:30 second then it's nowhere near fast 62:31 enough to deal with all those weeds now 62:37 Facebook they don't really talk about in 62:39 the paper but they do have automated 62:40 machinery to replace a failed memcache 62:44 server but that takes a while to sort of 62:48 set up a new server a new memcache 62:51 server and redirect all the front-end 62:53 to the new server instead of the old 62:55 server so in the meantime they need a 62:57 sort of temporary solution and that's 63:00 this gutter idea so let's say the scoop 63:06 is that we have our front ends we have 63:11 the sort of ordinary set of memcache 63:13 servers 63:17 the database the one of the memcache 63:21 service has failed we're kind of waiting 63:23 until the automatic memcache server 63:26 replacement system replaces this 63:27 memcache server in the meantime 63:30 friends are sending requests to it they 63:32 get a sort of server did not respond 63:35 error from the network and then there's 63:38 a presumably small set of gutter servers 63:46 whose only purpose in life is to eye 63:50 they must be idle except when a real 63:55 memcache server fails and when the front 63:57 end gets an error back saying that get 64:00 couldn't contact the memcache server 64:01 it'll send the same request to one of 64:05 the gutter servers and though the paper 64:06 doesn't say I imagine that the front end 64:08 will again hash the key in order to 64:10 choose which gutter server to talk to 64:13 and if the gutter server has the value 64:17 that's great 64:18 otherwise the front end server will 64:21 contact the database server to read the 64:22 value and then install it in the 64:25 memcache server in case somebody else 64:27 answer asks for the same data 64:28 so while this means down the gutter 64:32 servers will handle basically handle its 64:38 request and so they'll be a miss you 64:39 know handled by lease leases the 64:41 Thundering Herd 64:42 they'll be at least I need a Miss on 64:44 each of the items that was a no-fail 64:46 memcache server so there will be some 64:49 load in the database server but then 64:50 hopefully quickly this memcache server 64:51 will I get all the data that's listen 64:54 use and provide good service and then by 65:00 and by this will be replaced and then 65:01 the friends will know to talk to a 65:03 different replacement server and because 65:07 they don't and this is today's question 65:09 I think that they don't send deletes to 65:13 these gutter servers because since a 65:14 gutter server could have taken over for 65:16 anyone and maybe more than one of the 65:20 ordinary memcache service it could 65:22 actually have cache the caching any key 65:26 so that would mean that and there may be 65:30 you know friends talking to it that 65:33 would mean that whenever a front-end 65:35 needs it to delete a key from memcache 65:36 or when the squeal on the database sends 65:40 a delete for any key to the relevant 65:42 memcache server yeah you know the the 65:46 natural design would be that it would 65:47 also send a copy of that delete to every 65:51 one of the gutter servers and the same 65:53 for front ends that are deleting data 65:55 they would delete from the memcache 65:56 stores but they would also have to leave 65:58 potentially from any 65:59 MCAD gutter server that would double the 66:04 amount of defeats that had to be sent 66:05 around even though most of the time 66:06 these gutter servers aren't doing 66:08 anything and don't cache anything and it 66:09 doesn't matter and so in order to avoid 66:11 all these extra deletes they actually 66:15 fix the gutter servers so that they 66:18 delete Keys very rapidly instead of 66:22 hanging on to them until they're 66:23 explicitly deleted that was answer to 66:27 the question all right so I wanna talk a 66:34 bit about consistency all this at a 66:38 super high level you know the 66:43 consistency problem is that there's lots 66:45 of copies of the data for any given 66:47 piece of data you know there's a copy in 66:50 the primary database there's a copy in 66:53 the corresponding database server of 66:54 each of the secondary regions there's a 66:58 copy of that key in each local cluster 67:01 in one of the memcache keys in each 67:03 local cluster there may be copies of 67:06 that key and the gutter servers and 67:08 there may be copies of the key in the 67:11 memcache servers and the gutter memcache 67:13 servers at each other region so we have 67:14 lots and lots of copies of every piece 67:17 of data running around when a write 67:19 comes in you know the stuff has to 67:22 happen on all those copies and 67:23 furthermore the writes may come from 67:26 multiple sources the same key may be 67:28 written at the same time by multiple 67:30 front ends and this region may be by 67:32 friends and other regions too and so 67:35 it's this concurrency and multiple 67:38 copies and sort of multiple sources of 67:40 writes since there's multiple front ends 67:42 it creates a lot of opportunity for not 67:47 just for there to be stale data but for 67:49 data stale data to be left in the system 67:52 for long periods of time and so I want 67:58 to I want to illustrate what are those 68:02 problems actually in a sense we've 68:03 already talked a bit about this when 68:05 somebody asked why the front ends don't 68:08 update why do they delete instead of 68:10 updating so that's certainly one 68:12 instance of the kind of weather 68:13 multiple sources of data and so we have 68:17 trouble enforcing correct order but 68:22 here's another example of a race an 68:25 update race that if they hadn't done 68:28 something about it would have left data 68:30 indefinitely 68:31 stale data and definitely in memcache 68:36 it's going to be a similar flavor to the 68:38 previous example so supposing we have 68:40 client one he wants to read a key but 68:47 memcache says it doesn't have the data 68:51 it's a Miss so C one's gonna read the 68:55 data from from the database and let's 69:02 say it gets back some value that you 69:07 want 69:09 meanwhile client to wants to update this 69:12 data so it sends you know its rates he 69:18 equals v2 and sends that to the database 69:22 and then you know the rule for writes 69:26 the code for writes that we saw is that 69:28 the next thing we do is delete it from 69:29 the database from memcache d.c c2 is 69:33 going to delete 69:35 ah the key from the database oh that's a 69:39 Friday you know it was actually c2 69:41 doesn't really know what's in memcache d 69:42 but the leading was ever there is always 69:44 safe because certainly not gonna cause a 69:48 stale data to be deleting won't cause 69:50 her to be stilled Leena um and this is 69:53 the sense that the paper claims that 69:55 delete is idempotent said delete it's 69:59 always safe to kabhi but if you recall 70:02 the pseudocode for what a read does if 70:05 you miss and you read the data from the 70:07 database you're supposed to just insert 70:09 that data into memcache so client 1 you 70:12 know may have been slow and finally gets 70:13 around to sending a set RPC two memcache 70:19 T but it read version 1 and read a you 70:21 know what is now an old outdated version 70:23 of the data from the database but it's 70:26 going to set that into 70:33 set this into memcache and yeah you know 70:35 one other thing that happened is that we 70:36 know the database is is whenever you 70:39 write something I'm database that sends 70:40 deletes to memcache D so of course maybe 70:42 at this point the database will also 70:44 have sent a delete 70:47 for k2m caste and so now we get to 70:51 deletes but it doesn't really matter 70:52 right these lease may already have 70:53 happened by the time client one gets 70:55 around to updating this key and so at 70:59 this point indefinitely the memcache D 71:03 will be cashing a stale version of of 71:06 this data and there's just no mechanism 71:08 anymore or this is them if the system 71:11 worked in just this way there's no 71:14 mechanism for the memcache D to ever see 71:16 to ever get the actual correct value 71:20 it's gonna store and serve up stale data 71:23 for key K forever 71:27 and they because they ran into this and 71:30 while they're okay with data being 71:32 somewhat out-of-date they're not okay 71:35 with data being out of date forever 71:37 because users will eventually notice 71:38 that they're seeing ancient data and so 71:42 they had to solve this they had to make 71:44 sure that this scenario didn't happen 71:49 they actually solved this this problem 71:51 also with the lease mechanism at the 71:54 same lease mechanism that we describe 71:56 for the Thundering hoard although 71:58 there's an extension to the lease 72:01 mechanism that makes this work so what 72:04 happens is that when memcache descends 72:06 back a Miss indication seeing the data 72:07 wasn't in the cache it's gonna grant the 72:10 lease so we get the Miss indication plus 72:13 this lease which is basically just a big 72:17 unique number and the memcache server is 72:19 gonna remember that the association 72:21 between this lease and this key it knows 72:23 that somebody out there with a lease to 72:27 update this key the new rule is that 72:31 when the when the memcache server gets a 72:34 delete from either another client or 72:36 from the database server the memcache 72:39 server is gonna as well as deleting the 72:42 item is going to invalidate this lease 72:44 so as soon as either these deletes come 72:46 in assuming that Elyse arrived first the 72:49 memcache server is gonna believe this 72:52 lease from its table about leases this 72:55 set 72:59 is the lease back from the front end now 73:02 when the set arrives the memcache server 73:05 will look at the lease and say wait a 73:06 minute I you don't have a lease for this 73:09 key all right invalid if these fit this 73:12 key I'm gonna ignore this set so because 73:14 the lease has been because one of these 73:16 if one of these deletes came in before 73:18 the set this sees to be invalid in 73:21 invalidated and the memcache server 73:24 would ignore this set and that would 73:27 mean that the key would note just stay 73:29 missing from memcache and the next 73:33 client that tried to read that key 73:34 you'll get a Miss would read the fresh 73:37 data now from the database and would 73:40 install it in memcache and presumably 73:41 the second time around 73:43 the second readers lease would be valid 73:46 um you may and indeed you should ask 73:49 what happened that the order is 73:51 different so supposing these deletes 73:53 instead of happening before the set 73:56 these deletes were instead to have 73:59 happen after the set I want to make sure 74:02 this scheme still works then and so how 74:05 things would play out then is that since 74:08 if these Dilys were late happened after 74:12 the set the memcache server wouldn't 74:14 delete these from its table of visas 74:17 Solis would still be there when the set 74:19 came and yes indeed we would still then 74:22 it would accept the setting we would be 74:25 setting key to a stale value but our 74:30 assumption was this time that the 74:31 deletes had been late and that means the 74:33 Dilys are yet to arrive and when they 74:35 when these deletes arrive then this stay 74:37 on theta will be knocked out of the 74:39 cache and so the stale date will be in 74:41 the cache a little bit longer but we 74:43 won't have this situation where stale 74:45 data is sitting in the cache 74:47 indefinitely and never deleted 74:52 any questions 74:54 lissa machinery okay um to wrap up you 75:07 it's certainly fair to view this system 75:10 a lot of the complexity of the system as 75:12 stemming from the fact that it was sort 75:14 of put together out of pieces that 75:17 didn't know about each other like it 75:20 would be nice for example memcached he 75:21 knew about the database I'm understand 75:24 memcache D and the database kind of 75:25 cooperated in a consistency scheme and 75:31 perhaps if Facebook could have at the 75:35 very beginning you know predicted the 75:38 how things would play out on what the 75:40 problems would be and if they have had 75:42 enough engineers to work on it they 75:44 could have from the beginning built a 75:45 system that could provide both all the 75:48 things they needed high-performance 75:50 multi data center replication partition 75:54 and everything and they're having 75:57 companies that have done that so the 75:59 example I know of that sort of most 76:01 directly comparable to the system in 76:06 this paper is that if you care about 76:08 this stuff you might want to look at it 76:10 is Yahoo's peanuts storage system which 76:16 in a sort of designed from scratch and 76:20 you know different different in many 76:23 details but it does provide multi-site 76:26 replication with consistency and good 76:30 performance so it's possible to do 76:32 better but you know all the issues are 76:34 present that's just had a more 76:37 integrated perhaps elegant set of 76:40 solutions 76:41 the takeaway so for us from this paper 76:44 one is that for them at least and for 76:47 many big operations caching is vital 76:50 absolutely vital for to survive high 76:54 load and the caching is not so much 76:57 about reducing latency it's much more 76:59 about hiding enormous load from 77:02 relatively slow 77:05 storage servers that's what a cache is 77:08 really doing for Facebook is hiding sort 77:11 of concealing almost all the load from 77:13 the database servers another takeaway is 77:18 that you always in big systems you 77:20 always need to be thinking about caching 77:23 versus control versus sorry partition 77:25 versus replication I mean you need ways 77:28 of either formally or informally sort of 77:32 deciding how much your resources are 77:34 going to be devoted to partitioning and 77:35 how much to replication and finally 77:40 ideally you'd be able to do a better job 77:42 in this paper about from the beginning 77:45 integrating the different storage layers 77:47 in order to achieve good consistency 77:51 okay that is all I have to say please 77:56 ask me questions if you have