字幕記錄 00:02 alright hello everyone let's get started 00:10 today the topic is causal consistency 00:14 and then use the cop system the cop's 00:20 paper that we're every day is a case 00:23 study for causal consistency so the 00:26 setting is actually familiar we're 00:30 talking again about big websites that 00:35 have data in multiple data centers and 00:39 they want to replicate the data in each 00:42 of their all their data in each their 00:43 data centers have to keep a copy close 00:45 to users and for perhaps for fault 00:50 tolerance so as usual we have maybe I'll 00:55 have three data centers 01:02 and you know because we're building big 01:04 systems we're going to shard the data 01:05 and every data center is going to have 01:07 multiple servers with you know maybe all 01:09 the keys that start with Z a through all 01:12 the custodian corresponding shards of 01:14 URLs we've seen this for and you know 01:29 the usual goals people have you know 01:31 there's many different designs for how 01:33 to make this work but you know you 01:34 really like reads to be certainly like 01:37 reads to be fast because these web 01:39 workloads tend to be read dominated and 01:43 you know you'd like rights to work and 01:45 you'd like to have us as much 01:47 consistency as you can so the fast 01:52 reasons are interesting because the 01:53 clients are typically web browsers so 01:55 and there's web going to be some set of 01:59 web browsers which all call clients the 02:03 clients the storage system but they're 02:04 really web browsers talking to a user's 02:07 browser some so the typical arrangement 02:09 is that the reason these happen locally 02:11 and rights might be little more 02:14 complicated so one system that fits this 02:18 pattern is spanner you remember that a 02:22 spanner and spanner rights involved 02:24 Paxos that runs across all the data 02:27 centers so if you do a write in paxos 02:29 maybe a client in a data center needs to 02:33 do a write the communication involve 02:35 actually need requires taxes maybe 02:39 running on one of these servers to talk 02:41 to at least a majority of the other data 02:43 centers that are replicas so the rights 02:45 tend to be a little bit slow but there 02:49 consistent in addition Savannah supports 02:52 two-phase commit so we had transactions 02:54 and the reads are much faster because 02:57 the breeze used a true time scheme that 03:01 the span of paper described and really 03:04 only consulted local we also read the 03:07 Facebook memcache new paper which is 03:09 another design in this demo pattern the 03:13 Facebook memcache key paper there's a 03:15 primary site that has the primary set of 03:18 my sequel databases so if a client wants 03:21 to do a right I suppose the primary side 03:23 this data center 3 does to send all 03:25 rights to data center 3 and then data 03:27 center 3 sends out new information or in 03:30 validations to the other data centers 03:32 right so actually a little bit expensive 03:33 and not unlike spanner on the other hand 03:38 all the reads are local when a client 03:41 needs to do a bead it could consult a 03:42 memcache key server in the local data 03:44 center and there's memcachedb sir just 03:49 blindingly fast this the people 03:51 reporting them a single memcache the 03:53 server conserve a million reads per 03:56 second which is very fast so again the 03:59 Facebook memcache D scheme needs to 04:02 involve cross data center of 04:04 communication for rights but the reads 04:06 are local so the question for today and 04:09 the question of the cops papers 04:11 answering is whether we can have a 04:12 system that allows rights to pursue 04:16 purely locally and this from the clients 04:19 point of views that the client can talk 04:21 to the from once it were right they can 04:22 send the right local replica in its own 04:26 data center as well as some reads to 04:28 just the local replicas and never have 04:31 to wait for other data centers never 04:33 have to talk to other data centers or 04:34 wait for other data centers to do rights 04:38 so what we really want is a system that 04:40 can have local reads and local rights 04:48 that's the big that's the big goal 04:50 really a performance goal this would 04:53 help for performance of course cuz now 04:55 unlike spanner and Facebook paper we had 04:59 a purely local rights be much faster 05:00 from the clients point of view um it 05:04 might also help with fault tolerance 05:05 robustness if rights can be done locally 05:08 then we don't have to worry about 05:09 whether other data centers are up or 05:10 whether we can talk to them quickly 05:12 because the clients don't need to wait 05:15 for them so we're gonna be looking for 05:18 systems that have this this level of 05:20 performance and in the end we're gonna 05:25 let the consistency model you know cuz 05:27 we're going to be worried about 05:28 consistency if you only do the rights 05:29 initially to the local replicas you know 05:31 what about other data centers replicas 05:33 data so we'll certainly be worried about 05:36 consistency 05:36 but the attitude for this lecture at 05:38 least is that we're gonna let the 05:40 consistency trail along behind the 05:42 performance you know once we figure out 05:44 how to get good performance will well 05:46 then sort of figure out how to define 05:48 consistency in think about whether it's 05:51 good enough okay so that's the overall 05:54 strategy I'm gonna actually talk about 05:56 two strawman designs to sort of okay but 06:02 not great designs on the way to before 06:05 we actually talk about how cops works so 06:08 first i want to talk about a simplest 06:12 design that follows this local rating 06:16 strategy that I can think of I'll call 06:19 this straw man 1 so in straw and one 06:27 we're going to have three data centers 06:34 and let's just assume that the data is 06:37 charted two ways in each of them so he's 06:41 from maybe ATM and keys from embassy 06:43 shard it the same way in each of the 06:45 data centers and the clients will read 06:55 locally and if a client writes so 06:59 supposing a client needs to write it key 07:01 that starts with M the clients gonna 07:03 send a write of key M to the shard 07:08 server the local shard server that has 07:10 its responsible he's starting with M 07:13 that shard server would return reply to 07:15 the client immediately saying oh yes I 07:17 did you're right but in addition each 07:22 server will maintain a queue of 07:26 outstanding rights that have been sent 07:29 to it recently got clients that it needs 07:30 to send to other data centers and it 07:32 will stream these rights asynchronously 07:34 in the background to the corresponding 07:38 servers in the other data center so 07:40 after applying to the client our shard 07:42 server here will send a copy of the 07:45 clients right to each of the other data 07:48 setups and you know those rights go 07:52 through the network maybe they take a 07:53 long time eventually they're gonna 07:55 arrive at the target data set the other 07:57 data centers and each of those shard 07:58 servers will then apply the right to its 08:00 local table of data so this is a design 08:07 that has very good performance right the 08:10 reason rights are all done locally may 08:12 never have two clients never have to 08:13 wait there's a lot of parallelism 08:16 because you know this shard server for a 08:18 and the shirts are for a more 08:20 opportunity independently if the shard 08:22 server for a gets right you know it has 08:24 to push its data to the corresponding 08:27 shard servers and other data centers but 08:28 it can do those push pushes 08:30 independently of other shard service 08:33 pushes so there's parallelism both in 08:34 serving and and pushing the writes 08:36 around if you think about it a little 08:41 bit it's this design also essentially 08:44 effectively favors reads and the reads 08:48 really never have any impact beyond the 08:50 local data center the rights though do a 08:52 bit of work whenever you do a right you 08:53 know the client doesn't have to wait for 08:54 it but the shard server then has to push 08:57 the rights out to the other data centers 08:59 and you know means that new data the 09:01 other data center then proceed very 09:03 quickly so reads involve less work than 09:06 rights and that's appropriate for a read 09:09 heavy workload if you are more worried 09:12 about rate performance you could imagine 09:14 other designs for example you can 09:16 imagine design in which reads actually 09:18 have to consult multiple data centers 09:19 and rights are purely local so you can 09:22 imagine a scheme in which you have when 09:23 you do a read you actually read the data 09:25 from each of the other date the current 09:28 copy of the key you want from each of 09:30 the other data centers and choose the 09:32 one that's most recent perhaps and then 09:34 rights are very cheap and breeds are 09:37 expensive or you can imagine 09:38 combinations of these two strategies 09:41 some sort of quorum overlap scheme or 09:44 you write a majority and write a 09:46 majority at the only a majority of data 09:48 centers and meet a majority of data 09:49 centers and rely on the overlap and in 09:53 fact there are real live systems that 09:56 people use in commercially in real 10:00 websites that follow much this design so 10:02 if you're interested in sort of a real 10:04 world version of this you can look up 10:06 Amazon's dynamo system or the open 10:11 source 10:11 Kassandra system 10:14 there was much more elaborated than when 10:17 I've sketched out here but they follow 10:19 the same basic pattern so the usual name 10:23 for this kind of scheme is eventual 10:26 consistency and the reason for that is 10:31 that at least initially if you do a 10:40 write other readers and other data 10:43 centers are not guaranteed EC or right 10:46 but they will someday because you're 10:48 pushing out the rights so they'll 10:50 eventually see your data there's no 10:53 guarantee about order so for example if 10:56 I'm a client and I write he's starting 10:58 with them and then I write a key 11:00 starting with a sure you know M sends 11:03 out it's my right to shards of a rim 11:07 sends out one right and the server for a 11:09 sends out my right for a but you know 11:12 these may travel at different speeds or 11:14 different routes on the wide area 11:15 network and maybe I wrote maybe the 11:17 client wrote em first and then a but 11:19 maybe if they for a arrives first and 11:22 then the update for am and maybe I 11:24 arrive at the opposite order at the 11:26 other datacenter so different clients 11:28 are gonna observe updates in different 11:32 orders so there's you know no order 11:36 guarantee 11:40 the sense the sort of ultimate meaning 11:43 eventual consistency is that if things 11:46 settle down and people stop writing and 11:49 all of these write messages finally 11:51 arrive at their destinations are 11:53 processed then I'm an eventually an 11:56 eventually consistent system ought to 11:59 end up with the same value stored at all 12:03 of the all of the replicas that's the 12:11 sense of which it's eventually 12:12 consistent if you wait for the dust to 12:14 settle you're gonna end up with 12:16 everybody having the same data and 12:18 that's a pretty weak spec that's a very 12:20 weak spec but you know because it's a 12:24 loose spec there's a lot of freedom in 12:26 the implementation and a lot of 12:28 opportunities to get good performance 12:30 because the system basically doesn't 12:31 require you to instantly do anything or 12:34 to observe any ordering rules it's quite 12:38 different from most of the consistency 12:40 schemes we've seen so far again as I 12:43 mentioned it's used in deployed systems 12:45 eventual consistency is but it can be 12:48 quite tricky for application programmers 12:50 so let me sketch out a an example of 12:53 something you might want to do in a web 12:54 and the website where you would have to 12:58 be pretty careful you might be surprised 13:04 if this is an eventual consistency app 13:09 example suppose we're building a website 13:13 that stores photos and every user has a 13:16 you know set of photo photos stored as 13:19 you know key value pairs with some sort 13:21 of unique ID is the key and every user 13:26 has a list of maintains a list of their 13:29 public photos that they allow other 13:31 people to see so supposing I take a 13:34 photograph and I want to insert it into 13:37 this system or you know I human contact 13:40 the web server and the web server runs 13:43 code that's gonna insert my photo into 13:45 the storage system and then add a 13:47 reference to my photo to my photo list 13:50 so maybe 13:51 run maybe this happens we'll say it 13:54 happens on clients c1 which is the web 13:57 server I'm talking to and maybe but the 14:00 code looks like is there's a code calls 14:03 the put operation for my photo 14:07 and it really should be a keen about you 14:09 I'm just gonna candidates just a few 14:13 plus value so I insert my photograph and 14:15 then when this put finishes then I I add 14:19 the photo to my list 14:22 right that's what my my clients code 14:25 looks like somebody else is looking at 14:28 my photographs loosely gonna look fetch 14:30 a copy of my list of photos and then 14:34 they're gonna look at the photos that 14:35 are on the list so client to maybe calls 14:39 get for my list and then looks down the 14:44 list and then calls get on that photo 14:48 maybe they see the photo I just uploaded 14:50 it on the list 14:51 and they're gonna do a get it for the 14:53 you know key for that photo yeah so this 14:57 is like totally straightforward code 14:59 looks like it ought to work but in an 15:02 eventually consistent system it's not 15:05 necessarily going to work and the 15:07 problem is that these two puts even 15:10 though the client did them in such an 15:12 obvious order first insert the photo and 15:15 then add a reference to that photo to my 15:17 list of photos the fact is that in this 15:20 event early consistent scheme that I 15:22 outlined this second put could arrive at 15:25 other data centers before the first put 15:29 so this other client if it's reading at 15:31 a different data center might see the 15:34 updated list with my new photo in it but 15:37 when that other client and another data 15:40 center goes to fetch the photo that's in 15:42 the list this photo may not exist yet 15:44 because the first right may not have 15:47 arrived over the wife of the client 15:50 Tuesday's so if this is just gonna be 15:56 routine occurrence in an eventually 15:57 consistent system if we don't sort of 16:01 think of anything more clever this kind 16:05 of behavior where it sort of looks like 16:07 the code out of work you know at some 16:09 intuitive level but when you actually go 16:11 and read the spec for the system which 16:13 is to say no guarantees you realize that 16:15 ah you know this obviously this correct 16:18 looking code may totally not do it I 16:20 think it's going to do these are often 16:23 called anomalies and 16:29 you know the way to think about it is 16:31 not necessarily that this behavior you 16:33 know you saw the list third on a list 16:35 but the photo didn't exist yet it's not 16:37 an error it's not incorrect because 16:39 after all the system never guaranteed 16:41 that this food was gonna do that it's 16:43 gonna actually yield the photo here so 16:46 it's not that it's incorrect it's just 16:50 that it's weaker than you might have 16:52 hoped so it's still possible to program 16:55 such a system and people do it all the 16:58 time and there's a whole lot of tricks 17:00 you can use for example you know a 17:03 defensive programmer might observe 17:05 programmer might might write code 17:07 knowing that well if you say something I 17:09 mean list it may not really exist yet 17:11 and so if you see a reference to a photo 17:13 in the list you get a photograph that's 17:14 not there you just retry just wait a 17:17 little bit and retry because by and by 17:19 the photo will probably show up and if 17:22 it doesn't we'll just skip it and don't 17:24 display it to the user so it's totally 17:27 possible to program in this style but we 17:31 could definitely hope for behavior from 17:33 the storage system that's more intuitive 17:35 than this that would make the 17:37 programmers lie life easier sort of we 17:40 could imagine systems that have fewer 17:42 anomalies then yes very simple 17:46 eventually consistent system okay before 17:50 I go on to talking about how to maybe 17:54 make the consistency a little bit better 17:56 I want to discuss something important I 17:59 left out about this current eventual 18:01 consistency system and that's how to 18:02 decide on which right is most recent so 18:07 for some data if data might ever be 18:12 written by more than one party there's 18:15 the possibility that we might have to 18:20 decide which data item is newer so 18:22 suppose we have some key or call okay 18:27 and to rights for it you know to clients 18:32 launch rates for K so when client writes 18:35 a value one another client writes a 18:36 value of two we need to set up a system 18:42 so that all three data centers agree on 18:46 what the final value of key K is because 18:50 after all we're at least guaranteeing 18:51 eventual consistency when the dust 18:53 settles all the data centers all have 18:55 the same data so you know data center 18:58 three is gonna get these two rights and 19:01 it's gonna pick one of them as the final 19:02 value for K well of course um datacenter 19:06 - sees the same rights right it sees its 19:08 own right so they're all seeing this 19:13 pair breaks and they all had better make 19:14 the same decision about which one that'd 19:18 be the final value regardless and the 19:20 order that they arrived in right because 19:22 we don't know you know the data center 19:26 three may observe these to arrive in one 19:28 order and some other data center may 19:29 observe them to arrive in a different 19:31 order we can't just accept the second 19:33 one and have that be the final value 19:35 meet a more robust scheme for deciding 19:38 what the final the most recent value is 19:42 for a key so we're gonna need some 19:46 notion of version numbers and the the 19:50 most straightforward way to sign version 19:52 numbers is to use the wall clock time so 19:58 so why not wall clock time and the idea 20:01 is that when a what a client generates a 20:04 put either it or the shard server the 20:07 local chart server talks to will look at 20:09 the current time oh it's you know it's 20:12 125 right now and it'll sort of 20:15 associate that time as a version number 20:17 on its version of the key so then we'd 20:20 annotate these these write messages 20:25 these actually both store the timestamp 20:28 in the database and annotate these write 20:30 messages sent between data centers with 20:33 the time so you know maybe this one was 20:35 written at 102 and this right occurred 20:37 at 103 20:41 and so if if 102 writer or suppose the 20:47 one of three right arrives first then 20:49 the data center three will put in its 20:51 database this key and the timestamp one 20:56 two three 20:56 and when the right for 102 arrives the 21:00 standard Center will say oh actually 21:01 that's an older right I'm just gonna 21:04 ignore this right because it has a lower 21:05 timestamp and the time step I already 21:08 have and of course if they arrive and 21:09 the other order did a sentence we would 21:11 have actually stored this right briefly 21:13 until the right with the higher 21:16 timestamp arrived but then it would 21:17 replace it I mean since everybody sees 21:19 it some time stamps at least you know 21:22 when they finally receive whatever all 21:24 these great messages over the Internet 21:26 they're all gonna end up with the 21:29 databases that have the highest numbered 21:32 values okay so this almost works and 21:39 there's two there's two little problems 21:42 with it one is that the two data centers 21:47 if they do writes at the same time may 21:49 actually assign this little time stamp 21:51 this is relatively easy to solve and the 21:54 way it's typically done is the time 21:57 stamps are actually pairs of time or 22:00 whatever and the High bits essentially 22:02 hand some sort of identifier could 22:04 actually be almost anything as long as 22:06 it's unique some sort of identifier like 22:08 the data center name or ID or something 22:11 in the low bits just to cause all pipe 22:18 signs from different data centers or 22:19 different servers to be you want and 22:22 then if two rights ride with the same 22:23 time in them from different data centers 22:26 are gonna have different low bits and 22:28 these little bits will be used to 22:30 disambiguate which of the two right says 22:33 is the lower timestamp and therefore 22:36 should yield to the other with the 22:40 higher okay so we're gonna stick some 22:44 sort of ID in the bottom bits and the 22:46 paper actually talks about doing this 22:47 it's very common the other problem is 22:50 that this system works okay if all of 22:53 the data centers are exactly 22:56 synchronized in time and this is 22:58 something a spanner paper stressed a 23:00 great length so if the clocks on all the 23:02 servers that all the data centers agree 23:04 and this is gonna this is gonna be okay 23:08 but if the clocks are off by seconds or 23:11 maybe even minutes then we have a 23:13 serious problem here one not so 23:16 important problem is that rights that 23:23 come earlier in time you know that 23:26 should be overwritten by later right so 23:28 it could be the rights that came earlier 23:30 in real time are because the clocks are 23:33 on are assigned high time stamps and 23:35 therefore not superseded by rights that 23:38 came later in time now we never made any 23:43 guarantees about this ID eventual 23:45 consistency and we never said oh rights 23:49 that come later in time we're gonna win 23:51 over rightfully the clients nevertheless 23:58 we don't want to be already weak enough 24:02 consistency we don't want to have it 24:04 have needlessly 24:06 strange behavior like users really well 24:09 notice they eat something and then they 24:11 updated later doesn't seem to take 24:14 effect because the earlier update was 24:16 assigned two times thinks it's too large 24:19 in addition if some servers clock is too 24:23 high and it doesn't right you know if 24:26 it's caucus say a minute fast then it'll 24:29 be a whole minute when no other we have 24:37 to wait for all the service thoughts to 24:38 catch up to the minute fast servers 24:41 clock before anybody else can do the 24:43 write about heat in order to solve that 24:46 problem one way to solve that problem is 24:51 this idea called Lamport clocks the 24:54 paper talks about this 24:57 although the paper doesn't really say 24:59 why they use the lamp or clocks I'm 25:00 guessing it's at least partially for the 25:02 reason I just I don't mind 25:05 Flambeau clock is Wade assigned time 25:08 stamps that are related to real time but 25:12 which hope would this problem at some 25:14 servers having clocks that are running 25:16 too fast so every server keeps a value 25:22 called this T max which is the highest 25:26 version number it seems so far from 25:28 anywhere else so 25:34 so if somebody else is generating 25:36 timestamps that are you know ahead of 25:37 real-time you know the other servers to 25:40 see this timestamps their team axes will 25:41 reflect ahead of real time and then when 25:46 a server needs to assign a timestamp of 25:48 version number to a new put the way it 25:52 will do that is it'll take the max of 25:56 this team ax plus one and the wall clock 26:03 time 26:05 the real-time so that means that new 26:10 version number so this is the version 26:13 numbers that we need to accompany the 26:16 values in our venturi consistent system 26:18 so each new version number is going to 26:20 be higher than the highest version 26:22 number seen so higher than whatever the 26:26 last Rite was for example to the data 26:28 that we're updating and at least as high 26:31 as real-time so if nobody's clock is 26:35 ahead this Tmax plus one will probably 26:37 actually be smaller than real time and 26:38 the time stamp tool end up in real time 26:41 if some server has a crazy clock that's 26:43 too fast then that will cause all other 26:46 servers all the ones it's updates 26:48 advanced the team axes so that when they 26:51 allocate new version numbers higher than 26:53 the version number of whatever ladies 26:56 rate they saw from the server whose 26:58 clock is too fast okay so this is 27:03 Lampert clocks and this is how the paper 27:08 assigns version numbers come up all the 27:11 time in distributed systems alright so 27:16 another problem I want to bring up about 27:18 our eventually consistent system is the 27:25 problem of what to do about concurrent 27:27 rates to the same key it's actually even 27:32 worse the possibility that concurrent 27:35 rates might carry might both carry 27:38 important information that ought to be 27:40 preserved so for exam a client of both 27:49 did both of these you know different 27:51 clients client one and client - they 27:54 both issue a put to the same key 28:04 and both of these let's get sent it to 28:09 datacenter 3 the question is what's a 28:14 data center 3 do about the information 28:16 here and the information here this is a 28:24 real puzzle actually there's not a good 28:25 answer 28:26 the what the paper uses is the last 28:28 Raider wins that is datacenter 3 he's 28:30 gonna look at the version number that is 28:32 signed here and the version number is 28:33 assigned here one of them will be higher 28:35 slightly later in time data center ID 28:39 and a little bit higher or something I'm 28:41 a datacenter 3 you will simply throw 28:44 away the data with the lower timestamp 28:47 and accept the data with the higher 28:49 pants that stampin and that's it so it's 28:51 using this last Raider wins policy and 28:58 in that has the virtue that it's 29:03 deterministic and everybody's gonna get 29:05 the same answer you can think of 29:13 examples in which it's people so you 29:17 know for example supposing what these 29:18 puts are trying to do is increment a 29:20 counter so these clients both saw near 29:24 the counter with value 10 they both had 29:27 one and maybe we've put 11 right and but 29:30 you know what we really wanted to do is 29:32 have them both increment the counter and 29:33 have it had value 12 29:35 so in that case last Raider wins is 29:37 really not that great what we really 29:39 would have wanted was for datacenter 3 29:41 to sort of combine this increment and 29:43 that increment end up with the value of 29:46 12 so you know these systems are really 29:53 generally powerful enough to do that but 29:58 we would like better what we'd really 30:01 like is more sophisticated conflict 30:05 resolution 30:08 you 30:10 and the way other systems we've seen saw 30:13 this the most powerful system to support 30:15 real transactions so instead of instead 30:20 of just having put in get it actually 30:22 has increment operators that do atomic 30:24 transactional increments increments 30:27 weren't lost and that's sort of a 30:30 transactions of maybe the most powerful 30:32 way of doing resolving conflicting 30:35 updates we've also seen some systems 30:38 that support a notion of mini 30:39 transactions where at least on a single 30:43 piece of data you can have atomic 30:44 operations like atomic increment or 30:48 atomic test and set you can also imagine 30:52 wanting to have a system that does come 30:55 sort of custom conflict resolution so 30:57 exposing this value that we're keeping 30:59 here is a shopping cart you know with a 31:01 bunch of items in it and our user may 31:04 because they're running you know two 31:05 windows in the wind browser 31:06 adds two different items to their 31:08 shopping cart from two different web 31:11 servers 31:11 we'd like these two conflicting writes 31:14 to the same shopping cart to resolve 31:16 probably by taking set union of the two 31:19 shopping carts and ball instead of 31:21 throwing one away and accepting the 31:24 other I'm bringing this up 31:30 satisfying solution indeed the paper 31:33 doesn't really propose much of a 31:35 solution it's just a drawback of weekly 31:38 consistent systems that it's easy to get 31:43 into a situation where you might have 31:45 conflicting rights to the same data that 31:47 you would like to have sophisticated 31:50 sort of application specific resolution 31:52 to but it's generally quite hard and 31:55 it's just like a thorn in people's sides 31:58 that has to be lived with typically and 32:04 that that goes for both the eventual 32:05 consistency my straw man here hand for 32:07 the and for the P versus the paper dance 32:10 and a couple of paragraphs it it could 32:12 be used to do better 32:13 they don't really explore that because 32:15 it's difficult okay back to eventual 32:21 consistency my straw man system if you 32:26 recall it had a real problem with even 32:32 very simple this very simple scenario I 32:36 know we did put a photo and put a photo 32:38 list and then somebody else in a 32:40 difference and it reads the new this but 32:42 when they read the photo they find 32:44 there's nothing there so can we do 32:47 better can we build a system of it's 32:49 still allows local reads and local 32:52 rights but is has slightly mini-lesson 32:59 Thomas I'm going to propose one that's 33:03 strong me into this kind of one step 33:05 closer to papers so this is straw man 33:08 too 33:12 and 33:14 in this scheme I'm gonna propose a new 33:19 operator not just put and get but also a 33:21 sink operator that clients can use and 33:24 the sink operator will do will be the 33:29 key and a version number and what sink 33:34 does when a client calls it is sink 33:36 waits until all data centers copies of 33:40 key K are at least up to date as of the 33:44 specified version number so it's a way 33:47 of forcing order the client can say look 33:50 I'm gonna wait as well everybody knows 33:51 about this value and I wanna only see 33:54 after every one everything is Center 33:56 knows about this value and in order for 33:58 clients to know what version numbers to 34:02 to pass the sink we're gonna change the 34:06 put call a bit so that you say put key 34:08 value and put returns the version number 34:14 of the of this updated K you could call 34:21 this this sink is asking that acting is 34:23 a sort of a barrier offense we could 34:28 call this eventual consistency plus 34:36 barriers see calls the barrier 34:46 I'm gonna talk about how to users in a 34:48 moment but just keep in mind this thing 34:49 calls likely to be pretty slow because 34:51 the natural implementation of it is that 34:54 it actually goes out and talks to all 34:56 the other data centers and asked them 34:58 you know is your version of key pay up 35:01 to at least you know this version number 35:03 and then have to wait booth for the data 35:06 centers to respond and if any of them 35:07 say no it's got to then read that data 35:10 yes all right so how would you use this 35:13 well again for our photo list 35:16 now maybe client one that's updating 35:19 photos it's going to call put to insert 35:22 the photo it's gonna get a version 35:23 number now you know the programmer now 35:31 has to there's a danger here that update 35:36 the photo this but you know what if some 35:39 other data center you know hasn't seen 35:40 my photo yet so then the programmer is 35:43 gonna say sync and you're gonna sink the 35:48 photo wait for all data centers to have 35:51 that version number that was returned by 35:53 put and only after the sync return will 35:56 client one call put update update the 36:00 photo list and now if client two comes 36:03 along I must read the photo list and 36:05 then to be in the photo you know who 36:07 knows client two is going to do a get of 36:09 the photo list let's say time is 36:15 passing the same for them it's gonna do 36:18 a get at the photo list and if it sees 36:20 the photo on that list it'll do a get 36:23 you know again in its local data center 36:25 of the photo and now we're actually in 36:29 much better situation if client to in a 36:32 different data center saw the photo in 36:36 this list then that means that client 36:43 want had already called put on this list 36:45 because it's this put that adds the 36:48 footage of the list if client wanted 36:50 already called put on this list that 36:51 means the client one now given the way 36:54 this code works right had already called 36:56 sink and sink doesn't return until the 36:58 photo is present at all data centers so 37:01 that means that client two can the 37:03 programmer for client two can rely on 37:06 well the photos in the list that means 37:09 this whoever added the footage to the 37:11 list their sync completed and the fact 37:16 that this thing completed means the 37:17 photo is present everywhere and 37:18 therefore we can rely on this get photo 37:21 actually returning the photograph ok so 37:28 this works and it's actually reasonably 37:32 practical it does require fairly careful 37:37 thought on the part of the program or 37:38 the programmer you know has to think aha 37:40 I need a sink here I need to put sink 37:43 put in order for things to work out 37:45 right the reader for the readers much 37:48 faster but the reader still needs to 37:50 think oh you know I'm gonna at least 37:52 tested that the programmer has to you 37:56 know check with it if he's if the 37:58 programmer does a get list and then I 37:59 get photo from that list that uh you 38:01 know verify that indeed the code 38:03 modified the list called sink before 38:05 adding some things for list that is 38:08 quite a bit of thought 38:11 with this is all about the sink cause 38:13 all about is enforcing order make sure 38:15 that this completely finishes before 38:17 this happens the readers so that sink 38:22 and sort of explicitly forces order for 38:24 writers readers also have to think about 38:26 order the order is actually obvious in 38:28 this example but it is true that if the 38:31 writer did put a put then sink and then 38:34 put of a second thing then almost always 38:36 readers need to read the second thing 38:38 and then read the first thing because 38:41 guarantees you get out of this out of 38:43 this sink scheme these barriers is that 38:46 if a reader sees the second piece of 38:49 data then they're guaranteed to also see 38:51 the first piece of data so that means 38:53 the reader sort of need to be the second 38:55 piece of data first and then and then 38:57 the first item of data okay so there's a 39:03 there's a question about fault tolerance 39:05 mainly at if one data center goes down 39:07 that means the sink blocks until the 39:09 other data centers brought up 39:10 that's absolutely right so you're 39:12 totally correct this is a this is not a 39:16 great scheme alright this is sort of a 39:19 straw man on the way the cops this sink 39:21 called would block the way this actually 39:24 that sort of version of this that people 39:26 use in the real world to avoid this 39:30 problem will you know whatever data 39:31 centers down will the sink block forever 39:33 is that puts and gets both actually 39:37 consult a quorum of data center so that 39:41 this a sink will only wait for you know 39:45 say a majority year of data centers to 39:48 acknowledge that they have the latest 39:50 version of the photo and it get will 39:52 actually have to consult an overlapping 39:55 majority of data centers to in order to 39:58 get the data so things are not really 40:01 real versions of this are not perhaps as 40:03 rosy aasaiya as I may be implying again 40:08 the the systems that work in this way is 40:14 you're interested it's dynamo 40:16 and Cassandra and they use quorums to 40:21 avoid the wall talks pop okay okay so 40:28 this is a straightforward design and has 40:30 decent semantics even though it's slow 40:32 and this as you observe not very fault 40:34 tolerant the read performance is 40:36 outstanding because the reads are still 40:38 for local at least if we if the quorum 40:41 setup is read one write all and the 40:45 write performance is not great but it's 40:47 okay if you don't write very much or if 40:49 you don't mind great waiting and the 40:52 reason why you can maybe convince 40:53 yourself that the rate performance is 40:55 not a disaster is that after all the 40:58 Facebook memcache D paper has to send 41:01 all rights through the primary data 41:02 center so yeah you know Facebook runs 41:04 multiple data centers and clients talk 41:07 to all of them but the rates have to all 41:08 be sent to the my sequel databases at 41:11 the one primary data center 41:13 similarly spanner writes have to wait 41:15 for a majority of replica sites to 41:18 acknowledge the rights before the 41:19 clients thought to proceed so the notion 41:21 that clients might have dis rule that 41:23 the rights might have to wait to talk to 41:25 other data centers in order to allow the 41:27 reads to be fast is does not appear to 41:30 be outrageous in practice mom still I 41:35 you know you might like to nevertheless 41:38 have a system that does better than this 41:39 to somehow have semantics of sync that 41:42 sort of or sync is forcing this put 41:46 definitely appears to everyone to happen 41:48 before the second put you might like to 41:49 have that and without the cost so we'll 41:54 be interested in systems and this is 41:56 starting to get close to what cops does 41:58 interested in systems in which instead 42:00 of sort of forcing the clients to wait 42:02 at this point we somehow just encode the 42:04 order as a piece of information that 42:06 we're going to tell the readers or tell 42:08 the other data centers and a simple 42:16 do that which the paper mentions as a 42:21 non scalable implementation is that at 42:24 each data center so this is a logging 42:27 approach at each data center instead of 42:31 having the different shard servers talk 42:34 to their counterparts in other data 42:36 servers sort of independently instead at 42:42 every day the center we're gonna have a 42:43 designated log server that's in charge 42:47 of communicating of sending writes to 42:49 the other data center so that means if a 42:52 client does it right does it put to its 42:56 local shard and that's chart that shard 42:59 instead of just sending that the data 43:00 out sort of separately to the other data 43:02 centers will talk to his local log 43:07 server and append the right to the one 43:11 log that this data center is 43:13 accumulating and then if a client say 43:15 does a write to a different key maybe 43:18 we're writing key a and key B here 43:22 instead of again instead of this shard 43:25 server sending the right to key B sort 43:27 of independently it's gonna tell the 43:31 local log server to append the right to 43:34 the log and then the log server send out 43:37 their log to the other data centers in 43:40 log order so that all data centers are 43:45 guaranteed to see the right to a first 43:46 and they're gonna hope process that rate 43:49 to a first and then all data centers are 43:52 going to see our right to B and that 43:53 means if a client does a right to a 43:54 first and then does it right to be the 43:57 writes will show up in that order and it 43:59 in its log a and B and they'll be sent 44:01 the right to a first and then the right 44:04 to be to each of the t's data centers 44:06 and they probably actually have to be 44:08 sent to a kind of single log receiving 44:11 server which plays out the rates one at 44:15 a time as they arrive in log order so 44:19 this is the logging strategy if the 44:20 paper criticizes 44:23 it's actually regain some of the 44:26 performance we want because now clients 44:29 we're no longer we now eliminate the 44:31 sinks the clients can go back to this 44:33 going put of a and then put B and they 44:38 client puts can return as soon as the 44:41 data is sitting in the log at the local 44:43 log server so now client puts and gets 44:46 are now quite fast again 44:48 but we're preserving order sort of by 44:51 basically through the sequence numbers 44:54 of the entries and the logs rather than 44:56 by having the clients wait so that's 44:59 nice we get the order you know now we're 45:01 forcing ordered writes and we're causing 45:05 the rights to show up in order at the 45:08 other data center so that reading 45:09 clients will see them in order and so 45:11 our my example application might 45:13 actually work them out with this scheme 45:17 the drawback that the paper points to 45:20 about this style of solution is that the 45:24 log server now all the rights have to go 45:26 through this one log server and so if we 45:29 have a big big database with maybe 45:31 hundreds of servers serving at least in 45:34 total a reasonably high workload the 45:37 right workload all the rights have to go 45:39 through this log server and possibly all 45:42 the rights have to be played out through 45:45 a single receiving log server at the far 45:47 end and a single log server as the 45:50 system grows there get to be more and 45:52 more shards a single log server may stop 45:54 being fast enough to process all these 45:57 rates and so cops does not follow this 46:01 approach to conveying the order 46:06 constraints to other data centers 46:10 okay so we want to build a system that 46:15 can at least from the clients point of 46:17 view process rates and reads purely 46:18 locally we don't want to have to wait 46:20 you don't want clients to wait in order 46:22 to get order we want a forward we like 46:25 the fact that these rates are being 46:28 forward asynchronously but we somehow 46:31 want to eliminate the central log server 46:36 so we want to somehow convey 46:37 order information to other data centers 46:39 without having to funnel all our rates 46:41 through a single logs alright so now 46:45 that brings us to what cops is actually 46:48 up to so when I can talk about now is 46:51 starting to be what cops does and what 46:54 I'm talking about though is the non GTE 46:56 version of cops cops without get 47:00 transactions ok so the cops is the basic 47:09 strategy here is that when cops clients 47:11 read and write locally they accumulate 47:14 information about the order in which 47:16 they're doing things that's a little 47:19 more fine-grain than the logging scheme 47:21 and that information is sent to the 47:24 remote data centers whenever a client 47:26 doesn't quote so this we have this 47:29 notion of client context and as a client 47:34 does get some puts you know maybe a 47:36 client as a get of X and then I get it Y 47:42 and then a put of Z with some value 47:49 the context the library that the client 47:55 uses that implements Budhan get is going 47:57 to be accumulated in this context 47:59 information on the side as the put sand 48:01 gets occur so if a client doesn't get 48:05 and that yields some value with version 48:07 2 I'm just going to save that as an 48:11 example maybe get returns the current 48:13 value of x and that current value House 48:14 version 2 and maybe Y returns the 48:20 current value of version for what's 48:22 going to be accumulated in the context 48:25 is that this client has read X and a got 48:32 version 2 then after the get for why the 48:37 cops client libraries gonna add to the 48:39 context so that it's not just we've read 48:43 X and gotten version 2 but also now 48:44 we've read Y and gotten version 4 and 48:47 when the client does a put the 48:53 information that's sent to the local 48:57 shard server is not just put key and 49:02 what if the value is but also these 49:06 dependencies so we're going to tell the 49:08 local shard server for Z that this 49:12 client has already read before doing the 49:15 put X and got version 2 and Y and got 49:19 version 4 49:23 and you know what's going on here is 49:26 that we're telling where the client is 49:28 expressing this ordering information 49:30 that this put to Z now the client had 49:34 seemed X version 2 and Y version 4 49:37 before doing the foot so anybody else 49:39 who reads this version of Z had also 49:41 better be seeing X&Y with the Beasties 49:44 versions and similarly if the client 49:48 then does a put of something else say Q 49:54 what's going to be sent to the local 49:57 shard server is not just the Q and this 50:01 but also the fact that this client had 50:05 previously done some gets input so let's 50:08 suppose this put yields version 3 you 50:12 know the local shard server says a high 50:14 assigned version three to your new value 50:16 for Z then when we come to do the quit 50:19 of Q is going to be accompanied with 50:21 dependency information that says this 50:22 put comes after this put of Q comes 50:25 after the put of Z that created Z 50:29 version three and at least notionally 50:33 the rest of the context 50:38 ought to be passed as well although 50:40 we'll see that for various reasons cops 50:46 can optimize away this information and 50:49 if there's a proceeding put only sends 50:51 the version information for the point so 50:56 the question is is it important for the 50:58 context to be ordered I don't believe so 51:03 I think I think it's sufficient to treat 51:09 the context or at least the information 51:10 that's sent in the put as just a big bag 51:15 of dependencies for at least four 51:20 non-transactional cops okay so the 51:25 clients are community this context and 51:27 basically send the context with each put 51:29 and the context is encoding this order 51:34 information that in my previous straw 51:36 man straw man 2 was sort of forced by 51:39 sink instead of doing that we're not 51:41 waiting for accompanying these puts with 51:43 oh this put needs to come after these 51:47 previous values and this put needs to 51:48 come after these previous values cops 51:56 calls these relationships that this put 51:58 needs to come after these previous 52:00 values of dependency 52:05 and dependency and it writes it as 52:12 supposing this foot produces Z version 52:14 three we express it as really there's 52:20 two actually two dependencies here one 52:23 is that X version two comes before Z 52:28 version three and the other is that Y 52:32 version four comes before Z version 52:38 three and these are it's just definition 52:42 or notation that the paper uses to talk 52:46 about these individual pieces of order 52:50 information that cops needs to enforce 52:53 alright so so then what is this what is 52:58 this dependency information this passed 52:59 to the local shard server what does that 53:02 actually cause cops to do well eats cops 53:08 shard server when it receives a put from 53:12 a local client first it assigns the new 53:15 version number then it stores the new 53:18 value you know it stores for Z this new 53:21 value along with the version number that 53:22 it long version number that allocated 53:30 and then it sends the whole mess to each 53:33 of the other data center so at least 53:35 some non GT cops the local shard server 53:38 only remembers the key value and latest 53:42 version number doesn't actually remember 53:43 the dependencies and only forwards them 53:45 across the network to the other data 53:46 centers so now the position were in is 53:53 that let's say we had a client produced 53:57 a put of Z and some value it was 54:03 assigned version number v3 and it had 54:08 these dependencies 54:13 XV 2 + a YB 4 right and this is sent 54:19 from datacenter 1 let's say to the other 54:22 data center so we got a datacenter 2 and 54:25 datacenter 3 both receive this now in 54:29 fact this information is sent from The 54:31 Shard server for ze so there's lots of 54:34 shard servers but only the shard for Z 54:37 is involved in this 54:40 so here datacenter 3 the shirts are for 54:43 Z is going to receive this put from sent 54:49 by the client short server forwards it 54:56 this short server the the with this 55:00 dependency information that you know X V 55:02 2 and Y before come before Z B 3 but 55:04 that really means is operationally is 55:06 that this new version of Z can't be 55:10 revealed to clients until its 55:14 dependencies these versions of x and y 55:17 have already been revealed to clients in 55:22 datacenter 3 so that means that the 55:24 shard server visi must hold this right 55:27 must delay applying this right to Z 55:30 until it knows that these 2 dependencies 55:33 are visible in the local data center so 55:35 that means that Z has to go off let's 55:38 say the you know we have these shard 55:40 server for X and the shorts are for Y 55:43 Z's gotta actually send a message to the 55:45 shard server for X and the shard server 55:47 for Y saying you know what's the version 55:49 number for a current version for a 55:52 number for x and y and has to wait for 55:55 the result if both of these shards 55:58 servers say oh you know they give a 55:59 version number that's 2 or higher or 4 56:01 or higher for Y then Z can go ahead and 56:03 apply to put to its local table of data 56:09 however you know maybe these two shard 56:13 servers haven't received the updates 56:16 that correspond to version 2 of 56:17 extraversion for Y and that KC has to 56:20 hold on to this update the shard server 56:22 is he has to hold on to this update and 56:23 tell 56:25 the indicated versions of X or Y ever 56:27 actually arrived and been installed on 56:29 these two short servers so there may be 56:32 some delays now and only after these 56:35 dependencies are visible at datacenter 3 56:37 only then can the shards of a4z go ahead 56:40 and write updated stable for Z to have 56:43 version 3 ok and what's what that means 56:52 of course is that if a client the 56:54 datacenter 3 does a read for Z and sees 56:56 version 3 then because he already waited 56:59 that means if that client then reads X 57:01 or Y it's guaranteed to see at least 57:04 version 2 of X and at least version 2 of 57:06 Y because he didn't reveal the shards or 57:10 didn't reveal Z until it was sure the 57:13 dependencies would be visible ok so 57:19 question what if x and y never get their 57:20 values perhaps due to a network 57:22 partition with the Z shard block forever 57:24 yeah the um the semantics require the Z 57:29 shard to block forever that's absolutely 57:34 true so you know there's certainly an 57:36 assumption here that well they're you 57:40 know two ways that that might turn out 57:42 ok one is somebody repairs the network 57:45 or repairs whatever was broken and x and 57:48 y do eventually get their updates that 57:50 be one way to fix this and then z will 57:52 finally be able to apply the update 57:53 might have to wait a long time the other 57:57 possibility is maybe the data center is 57:59 entirely destroyed 58:00 you mean the building burns down and so 58:02 we don't have to worry about this at all 58:04 but it does point out a problem that's 58:07 real criticism of causal consistency and 58:13 that's that these delays can actually be 58:18 quite nasty because you can imagine oh 58:21 you know Z is waiting for the correct 58:23 value for X to arrive you know even if 58:25 there's no failures and nothing burns 58:26 down even mere slowness can be 58:29 irritating Z mate you have to wait for X 58:31 to show up well it could be that X has 58:34 already showed up and has arrived at 58:38 this charts 58:38 but it itself had dependencies maybe on 58:41 key a and so this chart server can't 58:43 install it until the update for a 58:45 arrives because X this put of X depended 58:49 on some key a and Z still has to wait 58:52 for that because what Z's waiting for is 58:54 for this version of X to be visible to 58:57 client so it has to be installed so if 58:59 it's arrived if the update for X is 59:00 arrived but itself is waiting for some 59:02 other dependency then we may get these 59:05 cascading dependency Waits and in real 59:09 life actually these you know these 59:11 probably would happen and it's one of 59:15 the problems that people bring up and 59:19 you know against calls are consistency 59:21 when when you try to persuade them it's 59:24 a good idea this problem of cascading 59:27 delays so that's too bad um although on 59:30 that note it is true that the authors of 59:33 the cops paper have a follow on P 59:37 actually a couple of interesting 59:38 follow-on papers but one of them has 59:41 some mitigations for this cascading 59:43 weight problem okay so for a photo 59:47 example this is the scheme this cop 59:50 scheme will actually solve our photo 59:51 example and the reason is that you know 59:53 this put we're talking about is the put 59:55 for the photo list the dependencies is 59:58 gonna have and its dependency list is 60:00 the insert of the photo and that means 60:05 that when the put for the photo Willis 60:06 arrives at the remote site the remote 60:10 chard server is essentially going to 60:11 wait for the photo to be inserted and 60:13 visible before it updates the photo list 60:16 so any client in a remote site that is 60:19 able to see the new photo of the updated 60:22 photo list is guaranteed to be able to 60:25 see the photo as well so this cop scheme 60:28 fixes the photo and photo list example 60:33 this the scheme the cops is implementing 60:39 is is usually called causal consistency 60:52 so there's a question is it's off to the 60:56 programmer to specify the dependencies 60:58 no it turns out that though that context 61:01 information this context information 61:04 that's accumulated here the cops client 61:08 library can accumulate it automatically 61:11 so the program only does gets and puts 61:15 and may not even need to see the version 61:23 numbers so simple program we just do 61:25 gets inputs and internally the cops 61:28 library maintains these contexts and 61:30 adds this extra information to the put 61:32 our pcs so that the programmer just just 61:38 does get some puts and system kind of 61:41 automatically tracks the dependency 61:43 information so that's very convenient I 61:47 mean just a you know pop up a level for 61:51 a moment you know we now built a system 61:53 that's that is as semantics powerful 61:57 enough to make the photo example code 62:00 work out correctly to have it sort of 62:02 had the expected result instead of 62:04 anomalous results and at least arguably 62:08 it's reasonably efficient because nobody 62:10 was you know the client never has to 62:12 wait for rights to complete there's none 62:14 of this sink business and also the 62:16 communication is mostly independent 62:19 there's no central log server so 62:20 arguably this is both reasonably high 62:22 performance and has reasonably good 62:26 semantics reasonably good consistency so 62:31 the the consistency that this design 62:36 produces is usually called causal 62:38 consistency and it's actually a much 62:42 older idea than this paper there's been 62:45 a bunch of call so consistency schemes 62:48 before this paper indeed a bunch of 62:50 follow-on work so it's a treating idea 62:53 that people like a lot 62:55 what causal consistency is what it sort 62:58 of means and here I am putting up I 63:02 think a copy of figure two from the 63:03 paper 63:05 the sort of what the definition says is 63:07 that the clients actions induce 63:10 dependencies so there's two ways that 63:15 dependencies are induced one is if a 63:17 given client there's a put and then I 63:21 get does it get and then a PUD or a put 63:24 and then a put then we say that the the 63:28 put depends on the previous put or get 63:31 so that in this case put of why if two 63:34 depends on the put of X of one so that's 63:37 one form of dependency another form of 63:41 dependency if is if one client frees a 63:43 value out of the storage system then we 63:46 say that that the get that that second 63:49 client issued depends on the 63:51 corresponding put that actually inserted 63:53 the value from a previous client and 63:55 furthermore we say that the dependency 63:59 relationship is transitive so that you 64:03 know this put depends on that get this 64:06 get by client two depends on the put by 64:09 client one and by transitivity in 64:11 addition we can conclude that the client 64:15 two's get depends on client ones gift 64:18 and so that means that this last put of 64:21 by client three for example depends on 64:23 all of these previous operations so 64:31 that's a definition of causal dependency 64:35 and then a causally consistent system 64:38 says that says that if through the 64:45 definition of dependency I just outlined 64:46 a depends on B sorry B depends on a and 64:51 a client reads B then the client must 64:56 subsequently also see a the dependency 64:59 so if client ever sees through a second 65:03 of two ordered operations operations 65:05 ordered by dependency and the client is 65:07 also then after that guaranteed to be 65:09 able to see the everything that that 65:13 operation depended 65:17 you know so that's the definition and 65:19 it's you know in a sense kind of 65:21 directly derived from what the system 65:25 actually does so this is very nice when 65:29 updates are causally related that is if 65:32 yeah you know these clients and in some 65:35 sense they're talking to each other you 65:36 know indirectly through the storage 65:37 system and so the clients are I kind of 65:39 wear that of you know if we somebody 65:42 reads this value and c5 sees five and 65:45 inspects the code you know they can 65:46 conclude that really really you know 65:48 this there's a sense in which this put 65:51 definitely must have come before this 65:55 last put and so if you see slash but you 65:56 really gosh you really just deserve to 65:58 see this first put so in that sense 66:00 causal consistency gives you this 66:02 programmers kind of a sort of well 66:08 behaved visas allows them to see well 66:13 behave values coming out of the storage 66:14 system another thing that's good about 66:18 causal consistency is that when it 66:20 updates when two values in the system 66:23 are not two updates are not causally 66:25 related the causal consistency system 66:28 you know the cops storage system has no 66:31 obligation is about maintaining order 66:33 between updates that are not causally 66:35 related so for example if I mean that's 66:41 good for performance so example if we 66:43 have you know client one does a put of X 66:46 and then I put Z and then around the 66:49 same time client two does a put of why 66:53 there's a you know there's no causal 66:55 relationship between these and therefore 66:58 you know sorry there's no causal 67:00 relationship between the put of Y and 67:02 any of the actions of client one and so 67:04 the cops is allowed to do all the work 67:09 associated with the put of Y completely 67:11 independently for client ones puts and 67:14 the way that plays out is that it's done 67:16 and the put of Y is sort of entirely 67:19 happens in the servers that for the 67:22 shard of why these two puts 67:24 are only involve servers for the shards 67:29 that X and Z are in it may require some 67:31 interaction here because the remote 67:35 servers for Z may have to wait for this 67:37 put to arrive not but they don't have to 67:39 talk to the servers that are in charge 67:41 of of Y so if that's a sense in which 67:44 causal consistency allows good allows 67:47 parallelism and good performance and you 67:51 know this is different from potentially 67:54 from linearizable systems like Anna 67:55 linearizable system the fact that this 67:57 put Y came after the put of X in real 67:59 time actually imposes some requirements 68:01 on the linearizable storage system but 68:04 there's no such requirements here for 68:06 causal consistency and so you might be 68:09 able to build a causal consistency 68:11 causally consistent system that's faster 68:13 than a linearizable system okay there's 68:18 a question would cops gain any more 68:20 information by including puts in the 68:22 client context okay so it's this may be 68:26 a reference to the today's lecture 68:28 question it is the case so why don't I 68:33 explain the answer for the electric 68:36 question the if a client does get of X I 68:45 mean look at its context does the get of 68:49 X maybe and then put to Y and then a 68:53 quit to Z in the context initially is X 68:58 version something you know that when we 69:02 client since the puts of the server it's 69:03 gonna include this context along with it 69:06 but in the actual system there's this 69:08 optimization that after a put the 69:13 context is replaced by simply the 69:20 version number for the put and any 69:22 previous stuff in the context like 69:24 namely this information about X is a 69:26 race from the from the clients context 69:28 so it only includes after put the 69:30 context is just replaced with 69:33 number returned from the put that so 69:37 isn't this returns you know version 69:38 version seven of why and the reason why 69:44 this is correct and doesn't lose any 69:47 information for the non-transactional 69:49 cops is that for this when this put is 69:57 sent out to all the remote sites the put 70:01 is accompanied by X version whatever in 70:04 the dependency list so this put won't be 70:06 applied until at all and each data 70:10 center until this X is also applied so 70:13 then when if the client then does this 70:16 put right what this turns into is sent 70:18 to other data centers is really a put 70:20 with Z and some value and the dependency 70:24 is just Y version seven all the other 70:27 data centers are going to wait for 70:29 they're gonna check before applying Z 70:32 they're gonna check that Y version seven 70:34 has been applied at their data center 70:36 well we know the Y version seven won't 70:39 be applied at their data center until X 70:41 version whatever is supplied at that 70:43 data center so there's sort of a 70:44 cascading delays here where that is 70:48 telling other data centers to wait for Y 70:50 version seven to be installed implies 70:53 that they must also already be waiting 70:55 for whatever Y version seven depended on 70:59 and because of that we don't need to 71:01 also include X version the X version and 71:06 this dependency list because those data 71:08 centers will already be waiting for it 71:10 that version of X so the answer the 71:12 question is no cops call the 71:15 non-transactional cops doesn't need to 71:18 have anything doesn't need to remember 71:20 the gets in the context after it's done 71:23 put 71:27 all right a final thing to note about 71:32 this scheme is that cops only see 71:36 certain relationships it's only aware of 71:39 certain causal relationships that is it 71:45 only you know cops is aware that if a 71:48 single client thread does a put and then 71:50 another put client you know cops record 71:53 so this the second put depends on the 71:55 first book furthermore cops is aware 71:58 that oh what a client does a read of a 72:01 certain value that it's depending on I'm 72:05 the one to put the created that value 72:07 and therefore depending on anything that 72:09 that depended on so you know cops is 72:11 directly aware of these dependencies 72:13 here however it could it is often the 72:18 case that causality in the larger sense 72:21 is conveyed through channels that cops 72:24 is not aware of so for example you know 72:29 if client one does a put of X and then 72:34 the human you know who's controlling 72:38 client one calls up client to on the 72:41 telephone or it's you know email or 72:44 something that says look you know I just 72:45 updated the database with some new 72:47 information why don't you go look at it 72:48 right and then client to you know does 72:52 it get of X sort of in a larger sense 72:57 causality would you know 73:00 suggest the client to really ought to 73:01 see the updated X because client to new 73:04 from the telephone call that X had been 73:07 updated and so if cops had known about 73:11 the telephone call it would have 73:12 actually included the it would have 73:19 actually caused the extra sorry if the 73:27 telephone call had been itself a put 73:28 right you know it would have been a put 73:31 of telephone call here and I get of 73:33 telephone call here and if this get had 73:36 seen that put cops would know enough to 73:39 arrange that this get would see that put 73:41 but because cops was totally unaware of 73:44 the telephone call there's no reason to 73:47 expect that this get would actually 73:48 yield the put value so cops is sort of 73:55 enforcing causal consistency but only 73:57 for the sources the kinds of causation 74:00 the cops is directly aware of and that 74:04 means that the sense in which cops is 74:08 causal consistency sort of eliminates 74:10 anomalous behavior well it only 74:13 eliminates anomalous behavior if you 74:17 restrict your notion of causality to 74:19 what cops can see it in the larger sense 74:21 you're going to still see odd behavior 74:23 you definitely going to see situations 74:24 where you know someone believes that a 74:27 values been updated and yet they do not 74:29 see the updated value that's because 74:31 their belief was caused by something 74:32 that cops wasn't aware of all right 74:41 another potential problem which I'm not 74:46 gonna talk about is that the remember 74:51 for the photo example with the photo 74:52 list there was a particular order of the 74:55 adding a photo and that particular 74:57 different order of looking at photos 74:59 that made the system work with causal 75:02 consistency as we're definitely relying 75:04 on the there being sort of if the reader 75:08 reads the photo list and then reads the 75:10 photo in that order that the fact that a 75:13 photos refer 75:14 joana photo list means that the read of 75:16 the photo will succeed it is however the 75:19 case that there are situations where no 75:22 one order of reading or combination of 75:25 orders of reading or writing will cause 75:30 sort of the behavior we want and that 75:32 but this is leading into transactions 75:34 which I'm not gonna have time enough to 75:36 explain but at least I want to mention 75:38 the problems the paper set up so 75:42 supposing we have our photo list and 75:47 it's protected by an access control list 75:50 and an access control this is basically 75:52 a list of usernames that are allowed to 75:55 look at the photos on my list 75:57 does that means that the software that 76:01 implements these photo lists with access 76:04 control this needs to be able to you 76:07 know read the list and then read the 76:09 access control list and see if the user 76:11 trying to do the read is in the access 76:13 control list and however neither order 76:17 of getting the access control list and 76:19 the list of photos works out so if the 76:22 client code first gets the access 76:24 control list and then gets the list of 76:29 photos that order actually doesn't 76:32 always work so well because supposing my 76:36 client Reesie access control list and 76:38 sees that I'm on the list but then right 76:40 here the owner of this photo this 76:44 deletes me from the access control list 76:47 and inserts a new photograph that I'm 76:50 not supposed to see in the list list 76:51 right so c2 does a you know a port of 76:56 access control is to delete me and then 76:59 a put of the photo list to add a photo 77:01 I'm not allowed to see then my client 77:05 gets around to the second get it sees 77:07 this list you may see this list which is 77:09 the now the updated list that has the 77:11 photo I'm not allowed to see but my 77:12 client thinks aha I'm in the access 77:14 control list because it's reading an old 77:15 one 77:18 and here's this photo so I'm allowed to 77:20 see you so in that case you know we're 77:23 getting an inconsistent what we know to 77:26 be an inconsistent sort of combination 77:29 of a new list and an old access control 77:32 list but there was really nothing but 77:34 but causal consistency allows this it'll 77:38 it calls a consistency only says well 77:40 you're gonna see data that's at least as 77:42 new as the dependencies every time you 77:46 do a get so and indeed if you know as 77:51 the paper points out if you think it 77:52 through it's also not correct for the 77:54 reading client to first read the list of 77:58 photos and then read the access control 78:01 this because sneaking in between this 78:05 though this might have a this that I 78:08 read may have a photo I'm not allowed to 78:09 see and at that time maybe the access 78:11 control this didn't include me but at 78:14 this point the owner of the list may 78:15 delete the private photo add me to the 78:18 access control list and then I may see 78:20 myself in the list so again if we do it 78:22 in this order it's also not right 78:23 because we might see we might get an old 78:28 list and a new access control list so 78:31 causal consistency as I've described it 78:34 so far isn't powerful enough to deal 78:36 with this situation you know we need 78:37 some notion of being able to get a 78:39 mutually consistent list and access 78:42 control lists through either sort of 78:45 both before some update or both after 78:48 and cops GT actually provides a way of 78:52 doing this it by essentially doing both 78:55 Gatz 78:57 but but cops GT sends the full set of 79:02 dependencies back to the client when it 79:04 doesn't get and that means that the 79:07 client is submit is in a position to 79:08 actually check the dependencies of both 79:10 of these return values and see that aha 79:14 you know there's a dependency for list 79:16 that is a version of these sorry for 79:20 that there might be a the dependency 79:23 list for the access control list me 79:25 mention that it depends on a version of 79:28 this that's in the farther ahead than 79:30 the version of Lists the 79:31 begone and in that case cops GT would be 79:34 fetch the data alright with one question 79:41 is it related to the threat of execution 79:45 yeah so it's true their causal 79:50 consistency doesn't really it's not 79:52 about wall clock time so it has no 79:55 notion of wall clock time there's only 79:58 the only sort of forms of order that 80:00 it's obeying that are even a little bit 80:03 related to walk walk time or that if a 80:05 single thread does one thing and then 80:07 another and another 80:08 then causal consistency does consider 80:12 these three operations to be in that 80:13 order but it's because one client thread 80:15 did these sequence of things and not 80:18 because there was a real time 80:19 relationship so just a wrap up here to 80:24 sort of put this into a kind of larger 80:26 world context causal consistency has has 80:32 been an is like a very kind of promising 80:37 research area and has been for a long 80:39 time because it does seem like it might 80:41 provide you with good enough consistency 80:43 but also opportunities more 80:47 opportunities and linearise ability to 80:49 get high performance however it hasn't 80:52 actually gotten much traction in the 80:54 real world 80:55 people use eventual consistency systems 80:58 and they use strongly consistent systems 81:00 but it's very rare to see a deployed 81:02 system and as causal consistency and 81:04 there's a bunch of reasons potential 81:06 reasons for that you know it's always 81:09 hard to tell exactly why people do or 81:11 don't use some technology for real-world 81:15 systems one reason is that it can be 81:19 awkward to track per client causality in 81:24 the real world a user and browser are 81:25 likely to contact different web servers 81:27 at different times and that means it's 81:31 not enough for a single web server to 81:32 keep users context we need some way to 81:36 stitch together context for a single 81:38 user as they visit different web servers 81:40 at the same website so that's painful 81:43 I know there is this problem that cops 81:45 doesn't doesn't track only tracks causal 81:48 dependencies it knows about and that 81:50 means it doesn't doesn't have a sort of 81:52 ironclad solution or doesn't sort of 81:55 provide ironclad causality and only sort 81:57 of certain kinds of causality which is 82:00 well sort of limits how appealing it is 82:05 another is that the you know eventual 82:09 and causal consistent systems can 82:11 provide only the most limited notion of 82:13 transactions and people more and more I 82:16 think as time goes on are sort of 82:19 wishing that their storage systems had 82:21 transactions I'm finally the amount of 82:25 overhead required to push around a track 82:27 and store all that dependency 82:30 information can be quite significant and 82:33 you know I was unable to kind of detect 82:35 this in the performance section of the 82:38 paper but the fact is it's quite a lot 82:40 of information that has to be stored and 82:41 pushed around and it you know if you 82:44 were hoping for the sort of millions of 82:46 operations per second level of 82:47 performance that at least Facebook was 82:50 getting out of memcache D the kind of 82:52 overhead that you would have to pay to 82:54 use causal consistency might be 82:57 extremely significant for the 82:59 performance so those are reasons why I'm 83:02 causal consistency maybe hasn't 83:03 currently caught on although maybe 83:08 someday it will be okay that's all I 83:12 have to say and actually starting next 83:14 lecture we'll be switching gears away 83:16 from storage and sequence of three 83:20 lectures that involve block chains so 83:24 I'll see you on Thursday 83:27 you