字幕記錄 00:00 today I want to do two things I want to 00:03 finish the discussion of zookeeper and 00:06 then talk about crack the particular 00:10 things that I'm most interested in 00:12 talking about a bad zookeeper are the 00:15 design of its API that allows the 00:17 zookeeper to be a general-purpose 00:19 service that really bites off 00:21 significant tasks that distributed 00:24 systems need so why is this you know why 00:27 is that a good API design and then the 00:29 really more specific topic of mini 00:33 transactions turns out this is a 00:35 worthwhile idea to know so they got API 00:38 and I'm just just a recall zookeepers 00:50 based on raft and so we can think of it 00:52 as being and indeed it is fault tolerant 00:54 and does the right thing with respect to 00:56 partitions it has this sort of 01:00 performance enhancement by which reads 01:04 can be processed at any replica and 01:06 therefore the reads can be stale so we 01:08 just have to keep this in mind as we're 01:09 analyzing various uses of the zookeeper 01:12 interface on the other hand zookeeper 01:15 does guarantee that every replicas 01:18 process the stream of rights in order 01:20 one at a time with all replicas 01:22 executing the rights in the same order 01:24 so that the replicas advance sort of in 01:27 their states of all than exactly the 01:30 same way and that all of the operation 01:33 reads and writes produced by a generated 01:36 by a single client or processed by the 01:38 system also in order both in the order 01:41 that the client issued them in and 01:43 successive operations from a given 01:45 client always see the same state or 01:48 later in the right stream as the 01:51 previous read operation right any 01:53 operation from that client okay so 01:59 before I dive into what the API looks 02:02 like and why it's useful it's worth just 02:06 thinking about what kinds of problems 02:08 zookeeper is aiming to solve or could be 02:10 expected to solve so 02:12 for me 02:13 a totally central example of motivation 02:23 of why you would want to use ooh keeper 02:25 this is it as an implementation of the 02:28 test and set service that vmware ft 02:31 required in order for either server to 02:34 take over when the other one failed it 02:38 was a bit of a mystery in the vmware 02:39 paper 02:40 what is this test instant service how is 02:43 it may you know is it fault tolerant 02:45 does it itself tolerate partitions but 02:49 zookeeper actually gives us the tools to 02:51 write a fault tolerant test and set 02:55 service of exactly the kind that vmware 03:00 ft needed that is fault tolerant and 03:03 does do the right thing under partitions 03:05 that's sort of a central kind of thing 03:06 that zookeepers doing there's also a 03:09 bunch of other ways that turns out 03:10 people use it suki was very successful 03:12 people use it for a lot of stuff one 03:15 kind of thing people use is just to 03:17 publish just configuration information 03:20 for other servers to use like for 03:21 example the IP address of the current 03:24 master for some set of workers this is 03:29 just config configuration information 03:34 another classic use of zookeepers to 03:37 elect a master you know if we want to 03:39 have a when the old master fails we need 03:41 to have everyone agree on who the new 03:43 master is and only elect one master even 03:45 if there's partitions you can elect a 03:49 master using zookeeper primitives if the 03:58 master for small amounts of stated 04:00 anyway if whatever master you elect 04:02 needs to keep some state it needs to 04:03 keep it up-to-date like maybe you know 04:06 informations such as who the primary is 04:09 for a given chunk of data like you'd 04:11 want in GFS the master can store its 04:13 state in zookeeper it knows new keepers 04:16 not going to lose it if the master 04:17 crashes and we elect a new master to 04:19 replace it that new master can just read 04:21 the old master state right out of 04:22 zookeeper and rely on it actually being 04:24 there 04:27 other things you might imagine maybe you 04:30 know MapReduce like systems workers 04:32 could register themselves by creating 04:34 little files and zookeeper and again 04:38 with systems like MapReduce you can 04:40 imagine the master telling the workers 04:43 what to do by writing things in 04:45 zookeeper like writing lists of work in 04:47 zookeeper and then worker sort of take 04:49 those work items one by one out of 04:52 zookeeper and delete them as they 04:54 complete them but people use zookeeper 04:56 for all these things question 05:03 yeah 05:11 exactly yeah so the question is oh how 05:16 people use zookeeper and in generally 05:18 yeah you you would if you're running 05:19 some big data center and you run all 05:21 kinds of stuff in your data center you 05:23 know web servers storage systems 05:26 MapReduce who knows what you might fire 05:28 up a zookeeper one zookeeper cluster 05:30 because this general purpose can be used 05:31 for lots of things so you know five or 05:34 seven zookeeper replicas and then as you 05:37 deploy various services you would design 05:39 the services to store some of the 05:41 critical state in your one zookeeper 05:43 cluster alright the API zookeeper looks 05:50 like a filesystem some levels it's got a 05:53 directory hierarchy you know there's a 05:56 root directory and then maybe you could 05:58 maybe each application has its own sub 06:01 directory so maybe the application one 06:03 keeps its files here in this directory 06:05 app two keeps its files in this 06:08 directory and you know these directories 06:11 have files and directories underneath 06:12 them 06:13 one reason for this is just because you 06:16 keeper is like just mentioned is a 06:18 design to be shared between many 06:20 possibly unrelated activities we just 06:23 need a naming system to be able to keep 06:25 the information from these activities 06:27 distinct so they don't get confused and 06:30 read each other's data by mistake 06:32 within each application it turns out 06:34 that a lot of convenient ways of using 06:36 zookeeper will involve creating multiple 06:39 files let's see a couple examples like 06:42 this in a few minutes okay so it looks 06:47 like a filesystem this is you know not 06:49 very deep it doesn't it's not actually 06:51 you know you can't really use it like a 06:52 file system in the sense of mounting it 06:54 and running LS and cat and all those 06:56 things it's just that internally it 06:58 names objects with these path names so 07:00 you know one this x y&z here few 07:04 different files you know when you talk 07:06 to me you send an RPC - zookeeper saying 07:08 you know please read this data you would 07:11 name the data you want maybe add up to 07:13 slash X there's just a sort of 07:16 hierarchical naming scheme these these 07:21 files and directories are called Z nodes 07:23 and it turns out it's there's three 07:27 types you have to know about that helps 07:30 you keep or solve various problems for 07:31 us there's just regular Z nodes where if 07:33 you create one it's permanent until you 07:36 delete it there's a femoral Z nodes 07:40 where if a client creates an ephemeral Z 07:42 node zookeeper will delete that 07:45 ephemeral Z node if it believes that the 07:48 client has died it's actually tied to 07:50 client sessions so clients have to sort 07:52 of send a little heartbeat in every once 07:54 a while into the zookeeper into 07:56 zookeeper say oh I'm still alive I'm 07:57 still alive so zookeeper won't delete 07:59 their ephemeral files and the last 08:04 characteristic files may have is 08:07 sequential and that means when you ask 08:10 to create a file with a given name what 08:12 you actually end up creating is a file 08:14 with that name but with a number 08:16 appended to the main and zookeeper 08:18 guarantees never to repeat a number if 08:21 multiple clients try to create 08:23 sequential files at the same time and 08:25 also to always use montt increasing 08:29 numbers for the for the sequence numbers 08:32 that are pens to filenames and we'll see 08:34 all of these things come up in examples 08:37 at one level the operations the RPC 08:40 interface that zookeeper exposes is sort 08:44 of what you might expect for your files 08:47 was to create RPC where you give it a 08:51 name really a full path name some 08:57 initial data and some combination of 09:02 these flags and interesting semantics of 09:09 create is that it's exclusive that is 09:11 when I send a create into zookeeper ask 09:13 it to create a file so you keep your 09:15 responds with a yes or no if that file 09:18 didn't exist and I'm the first client 09:20 who wants to create it zookeeper says 09:21 yes and creates the file the file 09:23 already exists zookeeper says no or 09:26 returns an error so clients know it's 09:28 exclusive create and clients know 09:30 whether they were the one client if 09:32 multiple clients are trying to create 09:33 the same file which we'll see in locking 09:35 samples 09:36 the clients will know whether they were 09:38 the one who actually managed to create 09:40 the file 09:46 there's also delete and one thing I 09:54 didn't mention is ever easy note has a 09:56 version as a current version number that 09:57 advances as its modified and delete 10:01 along with some other update operations 10:06 you can send an a version number saying 10:07 only do this operation if the files 10:10 current version number is the version 10:12 that was specified and that'll turn out 10:15 to be helpful if you're worried about in 10:17 situations where multiple clients might 10:18 be trying to do the same operation at 10:20 the same time so you can pass a version 10:23 saying only delete there's an exists 10:27 call oh does this path named Xenu exist 10:33 an interesting extra argument is that 10:36 you can ask to watch for changes to 10:39 whatever path name you specified you can 10:42 say does this path name exist and 10:43 whether or not exists it exists now if 10:46 you set this watch if you pass in true 10:48 for this watch flag zookeeper guarantees 10:50 to notify the client if anything changes 10:53 about that path name like it's created 10:55 or deleted or modified and furthermore 11:00 the the check for whether the file 11:03 exists and the setting of the watch 11:06 point of the watching information in the 11:08 inside zookeeper or atomic so nothing 11:11 can happen between the point at which 11:13 the point in the write stream which 11:16 zookeeper looks to see whether the path 11:18 exists and the point in the write stream 11:20 at which zookeeper inserts the watch 11:23 into its table and then that's like very 11:25 important for for correctness we also 11:31 get D then you get a path and again the 11:37 watch flag and now the watch just 11:40 applies to the contents of that file 11:43 there's set data 11:50 again path the new data and this 11:55 conditional version that if you pass an 11:58 inversion then zookeeper only actually 12:00 does the right if the current version 12:02 number of the file is equal to the 12:03 number you passed in okay so let's see 12:10 how we use this the first maybe almost 12:13 this first very simple example is 12:14 supposing we have a file in zookeeper 12:17 and we want to store a number in that 12:20 file and we want to be able to increment 12:21 that number so we're keeping maybe a 12:23 statistics count and whenever a client 12:24 you know I know gets a request from a 12:27 web user or something it's going to 12:29 increment that count in zookeeper and 12:34 more than one client may want to 12:36 increment the count that's the critical 12:39 thing so an example so one thing to sort 12:47 of get out of the way is whether we 12:49 actually need some specialized interface 12:52 in order to support client coordination 12:57 as opposed to just data this looks like 12:59 a file system could we just provide the 13:01 ordinary readwrite kind of file system 13:03 stuff that dated that typical storage 13:06 systems provide so for example some of 13:09 you have started and you'll all start 13:11 soon Ladd 3 in which you build a key 13:13 value store where the two operations are 13:15 the only operations are put key value 13:20 and so one question is can we do you 13:27 know all these things that we might want 13:28 to do with zookeeper can we just do them 13:30 with lab 3 with a key with a key value 13:32 put get interface so supposing for my I 13:35 want to implement this count thing maybe 13:38 I could implement the count with just 13:40 lab threes key value interface so you 13:43 might increment the count by saying x 13:45 equals get you know whatever key were 13:49 using and then put that key an X plus 1 13:59 why why is this a bad answer yes yes oh 14:08 it's not atomic that is absolutely the 14:11 root of the problem here and you know 14:15 the abstract way of putting it but one 14:19 way of looking at it is that of two 14:20 clients both want to increment the 14:22 counter at the same time they're both 14:24 gonna read they're both gonna use get to 14:26 read the old value and get you know ten 14:28 those gonna add one to ten and get 11 14:30 and I was gonna call put with 11 so now 14:33 we've increased the counter by one but 14:36 two clients were doing it so surely we 14:37 should have ended up increasing it by 14:39 two so that's why the lab three cannot 14:44 be used for even this simple example 14:47 furthermore in the sort of zookeeper 14:50 world where guests can return stale data 14:52 is not lab 3 or gets are not allowed to 14:55 return stale data but in zookeeper reads 14:57 can be stale and so if you read a stale 15:00 version of the current counter and add 15:02 one to it 15:03 you're now writing the wrong value you 15:05 know if 30 values 11 but you're get 15:09 returns a stale value of 10 you add 1 to 15:12 that and put 11 that's a mistake because 15:14 we really should have been putting 12 so 15:15 zookeeper has this additional problem 15:17 that we have to worry about that 15:19 that gets don't return the latest data 15:25 ok so how would you do this in zookeeper 15:32 here's how I would do this in zookeeper 15:40 it turns out you need to do you need to 15:42 wrap this code Siemens in a loop because 15:46 it's not guaranteed to succeed the first 15:48 time so we're just gonna say while true 15:55 we're gonna call get data to get the 15:57 current value of the counter and the 15:59 current version so we're gonna say X V 16:01 equals I'm get data and we need to say 16:09 final name I don't care what the file 16:11 name is we just say that nice now we get 16:13 the well we get a value and a version 16:16 number possibly not fresh possibly stale 16:20 but maybe fresh and then we're gonna use 16:26 a conditional put a conditional setting 16:45 and if set data is a set data operation 16:48 return true meaning it actually did set 16:50 the value we're gonna break otherwise 16:52 just go back to the top of the loop 16:55 otherwise so what's going on here is 17:00 that we read some value and some version 17:03 number maybe still maybe fresh out of 17:04 the replicas the set data we send 17:07 actually did the zookeeper leader 17:08 because all rights go to the leader and 17:10 what this means is only set the value to 17:12 X plus one if the version with the real 17:15 version the latest version is still is V 17:19 so if we read fresh data and nothing 17:23 else is going on in the system like no 17:24 other clients are trying to increment 17:26 this then we'll read the latest version 17:29 latest value we'll add one to the latest 17:31 value specify the latest version and our 17:34 set data will be accepted by the leader 17:35 and we'll get back a positive reply to 17:39 our request after it's committed and 17:42 we'll break because we're done if we got 17:45 stale data here or this was fresh data 17:47 but by the time 17:50 our set data got to the leader some 17:52 other clients set data and some other 17:55 client is trying to increment their set 17:56 data got there before us our version 17:58 number will no longer be fresh in either 18:00 those cases this set data will fail and 18:03 we'll get an error response back it 18:05 won't break out of the loop and we'll go 18:08 back and try again and hopefully we'll 18:11 succeed this time yes yes so the 18:25 question is could this it's a while loop 18:27 or we guaranteed is ever going to finish 18:29 and no no we're not really guaranteed 18:32 that we're gonna finish in practice you 18:36 know so for example if our replicas were 18:39 reading from is cut off from the leader 18:42 and permanently gives us stale data then 18:45 you know maybe this is not gonna work 18:47 out but you know but in real life well 18:51 in real life the you know leaders 18:53 pushing all the replicas towards having 18:56 identical data to the leader so you know 18:58 if we just got stale data here probably 19:00 when we go back you know maybe we should 19:02 sleep for 10 milliseconds or something 19:04 at this point but when we go back here 19:07 eventually we're gonna see the latest 19:08 data the situation under which this 19:10 might genuinely be pretty bad news is if 19:13 there's a very high continuous load of 19:17 increments from clients you know if we 19:18 have a thousand clients all trying to do 19:21 increments the risk is that maybe none 19:25 of them will succeed or something I 19:30 think one of them will succeed because I 19:31 think one of the most succeed because 19:35 you know the the first one that gets its 19:37 set data into the leader will succeed 19:40 and the rest will all fail because their 19:41 version numbers are all too low and then 19:43 the next 999 will put and get data's in 19:46 and one of them will succeed so it all 19:48 have a sort of N squared complexity to 19:50 get through all of the all other clients 19:54 which is very damaging but it will 19:55 finish eventually and so if you thought 19:57 you were gonna have a lot of clients you 19:59 would use a different strategy here this 20:01 is good 20:02 or load situations yes if they fit in 20:17 memory it's no problem if they don't fit 20:18 memory it's a disaster so yeah when 20:21 you're using zookeeper you have to keep 20:23 in mind that it's yeah it's great for 20:26 100 megabytes of stuff and probably 20:29 terrible for 100 gigabytes of stuff so 20:31 that's why people think of it as storing 20:32 configuration information rather than 20:35 their we old data of your big website 20:38 yes I mean it's sort of watch into this 20:53 sequence 20:58 yet that could be so if we want if we 21:04 wanted to fix this to work under high 21:06 load then you would certainly want to 21:13 sleep at this point where I'm not well 21:17 the way I would fix this my instinct I'm 21:19 fixing this would be to insert asleep 21:21 here and furthermore double the amount 21:25 of it sort of randomized sleep whose 21:30 span of randomness doubles each time we 21:33 fail and that's a sort of tried and true 21:37 strategies exponential back-off is a 21:39 it's actually similar to raft leader 21:42 election it's a reasonable strategy for 21:44 adapting to an unknown number of 21:47 concurrent clients so okay tell me 21:54 what's right okay so we're getting data 22:03 and then watching its true 22:17 so yes so if somebody else modifies the 22:22 data before you call set data maybe 22:25 you'll get a watch notification um the 22:28 problem is the timing is not working in 22:30 your favor like the amount of time 22:32 between when I received the data here 22:34 and when I send off the message to the 22:37 leader with this new set data is zero 22:39 that's how much time will pass here 22:42 roughly and if some other client is sent 22:47 in increment at about this time it's 22:51 actually quite a long time between when 22:53 that client sends in the increment and 22:54 when it works its way through the leader 22:56 and is sent out to the followers and 22:58 actually executed the followers and the 22:59 followers look it up in their watch 23:01 table and send me a notification so I 23:03 think 23:17 it won't give you any read result or if 23:25 you read at a point if you're gonna read 23:28 at a point that's after where the 23:30 modification occurred that should raise 23:32 the watch you'll get the notification of 23:34 the watch before you get the read 23:36 response but in any case I think nothing 23:40 like this could save us because what's 23:42 gonna happen is all thousand clients are 23:45 gonna do the same thing whatever it is 23:47 right they're all gonna do again and set 23:50 a watch and whatever they're all gonna 23:52 get the notification at the same time 23:53 they're all gonna make the same decision 23:54 about well they're all not gonna get to 23:57 watch because none of them has done the 23:59 put data yet right 24:01 so the worst case is all the clients are 24:03 starting at the same point they all do a 24:05 get they all get version one they all 24:08 set a watch point they don't get a 24:09 notification because no change has 24:10 occurred they all send a set data RPC to 24:14 the leader all thousand of them the 24:17 first one changes the data and now the 24:21 other 999 and get a notification when 24:23 it's too late because they've already 24:24 sent the set data so it's possible that 24:29 watch could help us here but sort of 24:33 straightforward version of watch I have 24:38 a feeling if you wanted the the mail 24:42 we'll talk about this in a few minutes 24:43 but the anon heard the second locking 24:48 example absolutely solves this kind of 24:50 problem so we could adapt to the second 24:53 locking example from the paper to try to 24:54 cause the increments to happen one at a 24:57 time if there's a huge number of clients 25:00 who want to do it other questions about 25:03 this example okay this is an example of 25:08 a what many people call a mini 25:11 transaction all right it's transactional 25:13 in a sense that wow there's you know a 25:15 lot of funny stuff happening here the 25:17 effect is that once it all succeeds we 25:22 have achieved an atomic 25:23 read-modify-write of the counter right 25:26 the difficulty here 25:28 is that it's not atomic the reading the 25:35 right the read the modifying the right 25:37 are not atomic the thing that we have 25:40 pulled off here is that this sequence 25:42 once it finishes is atomic right we 25:47 actually man and once we have to be on 25:49 the pass through this that we succeeded 25:50 we managed to read increment and write 25:53 without anything else intervening we 25:56 managed to do these two steps atomically 25:59 and you know this is not because this 26:05 isn't a full database transaction like 26:06 real databases allow fully general 26:09 transactions where you can say start 26:10 transaction and then read or write 26:12 anything you like maybe thousands of 26:14 different data items whatever who knows 26:15 what and then say end transaction and 26:17 the database will cleverly commit the 26:19 whole thing as an atomic transaction so 26:21 real transactions can be very 26:22 complicated zookeeper supports this 26:25 extremely simplified version of you know 26:27 when you're sort of one we can do it 26:30 atomic sort of operations on one piece 26:34 of data but it's enough to get increment 26:37 and some other things so these are for 26:39 that reason since they're not general 26:40 but they do provide atomicity these are 26:42 often called mini transactions and it 26:50 turns out this pattern can be made to 26:52 work with various other things too like 26:54 if we wanted to do the test and set that 26:57 vmware ft requires it can be implemented 27:00 with very much this setup you know maybe 27:02 the old value if it's zero then we try 27:06 to set it to one but give this version 27:08 number you know nobody else intervened 27:10 and we were the one who actually managed 27:11 to set it to one because the version 27:13 number hadn't changed but i'm leader got 27:14 our request and we win somebody else 27:17 changes to one after we read it then the 27:21 leader will tell us that we lost so you 27:23 can do test and set with this pattern 27:24 also and you should remember this is the 27:29 strategy 27:33 okay alright next example I want to talk 27:38 about is these locks and I'm talking 27:42 about this because it's in the paper not 27:43 because I strongly believe that this 27:46 kind of lock is useful but they have 27:52 they have an example in which a choir 27:57 has a couple steps one we try to create 28:01 we have a lock file and we try to create 28:05 the lock file now again some file with a 28:11 femoral set to true and so if that 28:17 succeeds then or not we've acquired the 28:22 lock the second step that doesn't 28:27 succeed then we want to wait for whoever 28:32 did acquire the lock what if this isn't 28:34 true that means the lock file already 28:36 exists I mean somebody else has acquired 28:37 the lock and so we want to wait for them 28:39 to release the lock and they're gonna 28:40 release the lock by deleting this file 28:42 so we're gonna watch yes 28:56 alright so we're gonna watch we're gonna 28:59 gonna call exists and watching is true 29:11 now it turns out that um okay and and 29:15 and if the file still exists right which 29:17 we expect it to because after all they 29:18 didn't exist presumably would have 29:20 returned here so if it exists we want to 29:21 wait for the notification we're waiting 29:25 for this watch notification call this 29:29 three and a step for go to what so the 29:39 usual deal is you know we call create 29:41 you know maybe we win if it fails we 29:45 wait for whoever owns a lock to release 29:47 it we get the watch notification when 29:49 the file is deleted at that point this 29:51 wait finishes and we go back to Mon and 29:53 try to recreate the file hopefully we 29:54 will get the file this time okay so we 29:59 should ask ourselves questions about 30:01 possible interleavings of other clients 30:04 activities with our four steps so one we 30:07 know for sure we know of already if 30:09 another client calls create at the same 30:11 time then the zookeeper leader is going 30:16 to process those two to create rpcs one 30:19 at a time in some order 30:20 so either mike reid will be executed 30:23 first 30:23 or the other clients create will be 30:24 executed first minds executed first i'm 30:28 going to get a true back in return and 30:29 acquire the lock and the other client is 30:31 guaranteed to get a false return and if 30:33 there are pcs processed first they'll 30:35 get the true return and i'm guaranteed 30:36 to get the false return and in either 30:38 case the file will be created 30:40 so we're okay if we have simultaneous 30:45 executions of one another question is 30:51 well you know if I if create doesn't 30:54 succeed for me and I'm gonna call exists 30:57 what happens if the lock is released 31:01 actually between the create and the 31:03 exists 31:09 so this is the reason why I rap I have a 31:12 knife around me around the exists is 31:14 because it actually might be released 31:15 before I call exists because it could 31:19 have been acquired quite a long time ago 31:21 by some other client and then if the 31:22 file doesn't exist at this point then 31:25 this will fail and I'll just go directly 31:26 back to this go to one and try again 31:32 similarly and actually more interesting 31:35 is what happens if the whoever holds it 31:39 now releases it just as I call exist or 31:43 as the replica I'm talking to is in the 31:45 middle of processing my exists requests 31:49 and the answer to that is that the 31:54 whatever replica I'm looking at you know 31:57 it's log or guaranteed that rights occur 32:02 in some order right 32:04 so the repla I'm talking to it's it's 32:06 log its proceeding in some way and my 32:10 exists call is guaranteed to be executed 32:15 between two log entries in the right 32:18 stream right this is a this is a 32:21 read-only request and you know the 32:24 problem is that somebody's delete 32:26 request is being processed at about this 32:27 time so somewhere in the log is going 32:32 either is going to be the delete request 32:35 from the other client and the rep and 32:37 you know this is my mind the replica 32:40 that I'm talking to zookeeper replicas 32:42 I'm talking to his log my watch my 32:45 exists RPC is either processed 32:47 completely processed here in which case 32:50 the replica sees oh the file still 32:53 exists and the replica inserts the watch 32:56 information into its watch table at this 32:59 point and only then executes the delete 33:02 so when the delete comes in were 33:03 guaranteed that my watch request is in 33:05 the replicas watch table and it will 33:07 send me a notification right or my exist 33:11 requests is executed here at a point 33:15 after the delete happen the file doesn't 33:17 exist and so now the call returns true 33:20 and 33:20 no well actually a watch table entry is 33:23 entered but we don't care right so it's 33:27 quite important that the rights are 33:28 sequenced and that reads happen at 33:32 definite points between rights yes well 33:54 okay so yes so this is where the exists 33:57 is executed the file doesn't exist at 33:58 this point exists returns false we don't 34:01 wait we go to one we create the file and 34:04 return we did install a watch here that 34:08 watch will be triggered it doesn't 34:10 really matter because we're not really 34:11 waiting for it but the watch will be 34:13 triggered by this created 34:23 we're not waiting for it but yeah okay 34:26 so the file doesn't exist we go to one 34:28 somebody else has created the file we 34:31 try to create the file that fails we 34:33 install another watch and it's a dis 34:35 watch that we're not waiting for so this 34:38 way does not a wait for anything to 34:40 happen although it doesn't really matter 34:42 in the moment it's not harmful to to to 34:47 break out of this loop early it's just 34:49 wasteful anyway we've all the history 34:53 this code leaves watches sort of in the 34:57 system and I don't really know what does 34:58 my new watch on the same file override 35:00 my old watch I'm not actually sure 35:08 okay I'm finally this example and the 35:12 previous example suffle suffer from the 35:14 herd effect we also heard effect we 35:16 talked about I mean what we were talking 35:18 about when we were worrying about oh but 35:20 if clients I'll try to increment this at 35:22 the same time 35:23 gosh that's going to have N squared 35:25 complexity as far as how long it takes 35:28 to get to all thousand clients this lock 35:30 scheme also suffers from the herd effect 35:32 in that if there are a thousand clients 35:35 trying to get the lock then the amount 35:37 of time that's required to sort of grant 35:40 the lock to each one of the thousand 35:43 clients is proportional to a thousand 35:44 squared because after every release all 35:47 of the remaining clients get triggered 35:50 by this watch all of the remaining 35:52 clients go back up here and send in a 35:53 create and so the total number create 35:55 our pcs generated is basically a 35:59 thousand squared so this suffers from 36:02 this herd the whole herd of waiting 36:06 clients is beating on zookeeper another 36:15 name for this is that it's a non 36:17 scalable lock or yeah okay and so the 36:23 paper is a real deal and we'll see it 36:26 more and in other systems and soon 36:31 enough serious end of problems the paper 36:32 actually talks about how to solve it 36:34 using zookeeper and the interesting 36:36 thing is that Zook 36:37 it's actually expressive enough to be 36:40 able to build a more complex lock scheme 36:46 that doesn't suffer from this hurt 36:48 effect that even of a thousand clients 36:49 are waiting the cost of one client 36:53 giving up a lock and another acquiring 36:55 it is order 1 instead of order n and 36:59 this is the because it's a little bit 37:02 complex this is the pseudocode in the 37:05 paper in section 2.4 it's on page 6 if 37:08 you want to follow along so this is and 37:23 so this time there is not a single lock 37:25 file 37:27 there's no yes it is just a name that 37:38 allows us to all talk about the same 37:40 lock so it's just a name know now I've 37:50 acquired the lock and I can do I can 37:53 whatever the lock was protecting you 37:55 know maybe only one of us at a time 37:57 should be allowed to give a lecture in 37:59 this lecture hall if you want to give a 38:00 lecture in this lecture hall you first 38:02 have to acquire the lock called 34 100 38:07 the that turns out it's yes it's a Z 38:10 node and zookeeper but it like nobody 38:12 cares about its contents we just need it 38:14 to be able to agree on a name for the 38:16 lock that's the sense in which that's 38:21 piyah this it looks like a file system 38:23 but it's really a naming system alright 38:28 so step one is we create a sequential 38:31 file 38:37 and so yeah we give it a prefix name but 38:39 what it actually creates is you know if 38:42 this is the 27th file sequential file 38:45 created with with prefix F you know 38:48 maybe we get F 27 or something and and 38:53 in the sequenced in the sequence of 38:56 writes that zookeeper is it's working 38:58 through successive creates get ascending 39:03 guaranteed ascending never descending 39:05 always ascending sequence numbers when 39:08 you create a sequential file there was 39:15 an operation I left off from the list it 39:16 turns out you can get a list of files 39:18 you can get a list of files underneath 39:25 you give the name of Zeno that's 39:29 actually a directory with files in it 39:30 you can get a list of all the files that 39:31 are currently in that directory so we're 39:33 gonna list the files let's start with 39:35 that you know maybe list f star we get 39:41 some list back we create a file with the 39:47 system allocated us a number here we can 39:48 look at that number if there's no lower 39:51 numbered file in this list then we win 39:54 and we get the lock 39:55 so if our sequential file is the lowest 39:57 number file with that name prefix we win 40:00 so no lower number we've quired the lock 40:10 and we can return if there is one then 40:18 again what we want to wait for then 40:21 what's going on is that these 40:23 sequentially numbered files are setting 40:25 up the order in which the lock is going 40:28 to be granted to the different clients 40:30 so if we're not the winner of the lock 40:33 what we need to do is wait for the 40:35 previously numbered with the client who 40:39 created the previously numbered file to 40:41 release to acquire and then release the 40:43 lock and we're going to release the lock 40:45 the convention for releasing the locking 40:47 in this system is for 40:49 remove the file to remove your 40:51 sequential file so we want to wait for 40:53 the previously numbered sequential file 40:56 to be deleted and then it's our turn and 40:59 we get the lock so we need to call 41:01 exists so we're gonna say if the call 41:05 exists mostly to set a watch point 41:09 so it's you know next lower number file 41:16 and we want to have a watch get that 41:23 file still exist we're gonna wait and 41:25 then so that's step 5 41:28 and then finally we're gonna go back to 41:32 we're not going to create the file again 41:33 because it already exists we're gonna go 41:35 back to listing the yeah the files so 41:41 this is a choir releases just I delete 41:44 if I acquire the lock I delete my the 41:47 file I created complete with my number 41:50 yes 41:54 why do you need to list the files again 42:02 that's a good question so the question 42:03 is we got the list of files we know the 42:08 next lower number file there's a 42:11 guarantee of the sequential file 42:12 creation is that once filed 27 is 42:15 created no file with a lower number will 42:18 ever subsequently be created so we now 42:20 know nothing else could sneak in here so 42:22 how could the next lower number file you 42:25 know why why do we need to list again 42:27 why don't we just go back to waiting for 42:29 that same lower numbered file thing 42:34 Britney guess the answer 42:43 I mean the the the way this code works 42:46 the answer to the question is whoever 42:49 was the next lowered person might have 42:51 either acquired him at least the lock 42:53 before we noticed or have died and this 42:59 went and these are transient files sorry 43:04 or whatever they're called ephemeral 43:06 there's an ephemeral file you know even 43:13 if we're 27th in line number 26 may have 43:17 died before getting the lock if number 43:19 26 dies the system automatically deletes 43:22 their ephemeral files and so if that 43:25 happened now we need to wait for number 43:27 25 that is the next you know it if all 43:31 files you know 2 through 27 and and 43:33 we're 27 if they're all they are and 43:35 they're all waiting there's a lock if if 43:37 the one before is dies before getting 43:39 the lock now we need to wait for the 43:41 next next lower number file not because 43:43 the next lower one is has gone away so 43:47 that's why we have to go back and relist 43:48 the files in case our predecessor in the 43:50 list of waiting clients turned out to 43:53 die yes 44:02 if there's no lower numbered file than 44:04 you have acquired the lock absolutely 44:09 yes how does this not suffer from the 44:15 herd effect suppose we have a thousand 44:20 clients waiting and currently client 44:22 made through the first five hundred and 44:24 client five hundred holds the lock every 44:30 client waiting every client is sitting 44:31 here waiting for an event but only the 44:36 client that created file five hundred 44:38 and one he's waiting for the vision of 44:41 file five hundred so everybody's waiting 44:44 for the next lower number so five 44:45 hundred is waiting for 499 twenty nine 44:48 nine but everybody everybody's waiting 44:51 for just one file when I release the 44:53 lock there's only one other client the 44:56 next higher numbered client that's 44:57 waiting for my file so when I release 44:59 the lock one client gets a notification 45:02 one client goes back and lists the files 45:07 one client and one client now has the 45:10 lock so the sort of expense you know no 45:14 matter how many clients that are the 45:15 expense of one of each release and 45:18 acquire is a constant number of our PCs 45:22 where's the expense of a release and 45:26 acquire here is that every single 45:28 waiting client is notified and every 45:31 single one of them sends a write request 45:33 than the create request into zookeeper 45:42 oh you're free to get a cup of coffee 45:54 yeah I mean this is you know what the 45:57 programming interface looks like is not 46:00 our business but this is either and 46:03 there's there's two options for what 46:05 this actually means as far as what the 46:07 program looks like one is there's some 46:09 thread that's actually in a synchronous 46:11 wait it's made a function call saying 46:13 please acquire this lock and the 46:14 function hold doesn't return until the 46:16 locks finally acquired or the 46:17 notification comes back of much more 46:20 sophisticated interface would being one 46:22 in which you fire off requests a 46:23 zookeeper and don't wait and then 46:25 separately there's some way of seeing 46:28 well as you keep your said anything 46:29 recently or I have some go routine whose 46:32 job it is just wait for the next 46:34 whatever it is from zookeeper in the 46:36 same sense that you might read the apply 46:39 Channel and just all kinds of 46:40 interesting stuff comes up on the apply 46:41 channel so that's a more likely way to 46:44 structure this but yeah you're totally 46:45 either through threading or some sort of 46:48 event-driven thing you can do something 46:51 else while you're waiting yes yes or if 47:06 the person before me has neither died 47:11 nor released it's a file before me 47:17 exists that means either that client is 47:20 still alive and still waiting for the 47:22 lock or still alive and holds the lock 47:25 we don't really know 47:35 it does it as long as that client 500 47:38 still live if if this exists fails that 47:42 means one of two things either my 47:43 predecessor held the lock and is 47:45 released it and deleted their file or my 47:47 predecessor didn't hold the lock they 47:49 exited and zookeeper deleted their file 47:52 because it was an ephemeral file so 47:55 there's two reasons to come out of this 47:58 to come out of his weight or four they 48:01 exist to return false and that's why we 48:03 have to like we check everything you 48:08 know you really don't know what the 48:09 situation is after the exists completes 48:30 that might that yeah maybe maybe that 48:32 could need to work that sounds 48:33 reasonable 48:34 and it preserves the sort of scalable 48:38 nature of this and that each require 48:39 release only involves a few clients two 48:43 clients 48:48 alright this pattern to me actually 48:52 first saw this pattern a totally 48:54 different context and scalable locks for 48:56 threading systems I go this end in for 49:01 most of the world this is called a scale 49:02 of a lock 49:10 I find it one of those interesting 49:12 constructions I've ever seen now and so 49:18 like I'm impressed that zookeeper is 49:20 able to express it and it's a valuable 49:22 construct having said that I'm a little 49:28 bit at sea about why zookeeper about why 49:31 the paper talks about locks at all 49:33 because these locks these locks are not 49:38 like threading locks and go because in 49:41 threading there's no notion of threads 49:43 failing at least if you don't want them 49:45 there to be there's no notions of 49:46 threads just sort of randomly dying and 49:48 go and so really the only thing you're 49:50 getting out of a mutex it's really the 49:52 case and go that when you use it if 49:54 everybody uses mutexes correctly you are 49:56 getting atomicity for the sequence of 49:59 operations inside the mutex that you 50:02 know if you take out a lock and go and 50:04 you do 47 different read and write a lot 50:06 of variables and then release the lock 50:07 if everybody follows that locking 50:09 strategy nobody's ever going to see some 50:12 sort of weird intermediate version of 50:14 the data as of halfway through you're 50:16 updating it right just makes things 50:18 atomic no argument these locks aren't 50:20 really like that because if the client 50:22 that holds the lock fails it just 50:25 releases the lock and somebody else can 50:28 pick up the lock so it does not 50:30 guarantee atomicity because you can get 50:33 partial failures and distributed systems 50:35 where you don't really get partial 50:37 failures of ordinary threaded code so if 50:41 the current lock holder had the lock and 50:43 needed to update a whole bunch of things 50:45 that were protected by that lock before 50:46 releasing and only got halfway through 50:48 updating this stuff and then crashed 50:51 then the lock will get released you'll 50:53 get the lock and yet when you go to look 50:55 at the data it's garbage because it's 50:58 just whatever random seed it was in the 51:00 middle of 51:00 updated so there's these locks don't by 51:04 themselves provide the same atomicity 51:06 guarantee that threading locks do and so 51:09 we're sort of left to imagine for 51:11 ourselves by the paper or why you would 51:13 want to use them or why this is the sort 51:15 of some of the main examples in the 51:16 paper so I think if you use locks like 51:21 this then you sort in a distributed 51:22 system then you have two general options 51:24 one is everybody who acquires a lock has 51:28 to be prepared to clean up from some 51:30 previous disaster right so you acquire 51:33 this lock you look at the data you try 51:35 to figure out gosh if the previous owner 51:37 of a lot crashed 51:38 you know when I'm looking at the data 51:41 you know how can I fix the data to make 51:44 up how can I decide if the previous 51:46 owner crashed and what do I do to fix up 51:48 the data and you can play that game 51:51 especially if the convention is that you 51:55 always update in a particular sequence 51:57 you may be able to detect where in that 51:58 sequence the previous holder crashed 52:00 assuming they crashed but it's a you 52:04 know it's a tricky game the requires 52:05 thought of a kind you don't need for 52:07 like thread locking um the other reason 52:10 maybe these locks would make sense is if 52:12 there's sort of soft locks protecting 52:16 something that doesn't really matter 52:17 so for example if you're running 52:20 MapReduce jobs map tasks reduce tasks 52:24 you could use this kind of lock to make 52:26 sure only one task only one worker 52:30 executed each task so workers gonna run 52:33 test 37 it gets the lock for task 37 52:36 execute it marks it as executed and 52:38 releases it well the way not produce 52:42 works it's actually proof against 52:44 crashed workers anyway so if you grab a 52:49 lock and you crash halfway through your 52:51 MapReduce job so what the next person 52:53 who gets the lock you know because your 52:55 lock will be released when you crash the 52:56 next version who gets it will see you 52:57 didn't finish the task and just we 52:59 execute it and it's just not a problem 53:01 because of the way MapReduce is defined 53:03 so you could use these locks or some 53:05 kind of soft lock thing although anyway 53:09 and you know maybe the other thing which 53:11 we should be thinking about is that some 53:13 version of this 53:14 be used to do things like elect a master 53:17 but if what we're really doing here is 53:19 electing a master you know we could use 53:22 code much like this and that would 53:23 probably be a reasonable approach yeah 53:25 oh yeah yeah yeah so the picking of 53:42 paper talk that remember the text in the 53:43 paper were says it's going to delete the 53:45 ready file and then do a bunch of 53:47 updates to files and then recreate the 53:49 ready file that would that is a 53:51 fantastic way of sort of detecting and 53:55 coping with the possibility that the 53:57 previous lock held or the previous 53:58 master or whoever it is crashed halfway 54:01 through because gosh the ready file has 54:02 never be created 54:18 Inigo program yeah sadly that is 54:21 possible and you know either okay so the 54:25 question is nothing about zookeeper but 54:27 if you're writing threaded code and go a 54:29 thread acquires a lock could it crash 54:32 while holding the lock halfway through 54:34 whatever stuff it's supposed to be doing 54:37 while holding a lock and the answer is 54:38 yes actually there are there are ways 54:40 for an individual thread to crash and go 54:42 oh I forget where they are maybe divide 54:45 by zero certain panics anyway you can do 54:48 it and my advice about how to think 54:54 about that is that the program is now 54:56 broken and you've got to kill it because 55:00 in threaded code the way the thing about 55:02 locks is that while the lock is held the 55:06 invariants in the data don't hold so 55:09 there's no way to proceed if the lock 55:12 holder crashes there's no safe way to 55:15 proceed because all you know is whatever 55:17 the invariants were that the lock was 55:18 protecting no longer hold so and so and 55:23 if you do want to proceed you have to 55:24 leave the lock marked as held so that no 55:27 one else will ever be able to acquire it 55:29 and you know unless you have some clever 55:31 idea that's pretty much the way you have 55:33 to think about it in a threaded program 55:35 because that's kind of the style with 55:37 which people write threaded lock 55:38 programs if you're super clever you 55:40 could play the same kinds of tricks like 55:44 this ready flag trick now it's super 55:49 hard and go because the memory model 55:51 says there is nothing you can count on 55:54 except if there's a happens before 55:56 relationship so if you play this game of 55:59 writing changing some variables and then 56:00 setting a done flag that doesn't mean 56:04 anything unless you release a lock and 56:08 somebody else acquires a lock and only 56:10 then can anything be said about the 56:13 order in which or in even whether the 56:15 updates happen so this is very very hard 56:18 it rivairy hard and go to recover from a 56:21 crash of a thread that holds the lock 56:25 here is maybe a little more possible 56:31 okay okay okay that's all I want to talk 56:42 about with zoo keeper 56:44 it's just two pieces of high bid one is 56:46 at these clever ideas for high 56:48 performance by reading from any replica 56:50 but the they sacrifice a bit of 56:52 consistency and the other interesting 56:55 thing uninteresting take-home is that 56:57 they worked out this API that really 56:59 does let them be a general-purpose sort 57:02 of coordination service in a way that 57:04 simpler schemes like put get interfaces 57:06 just can't do so they worked out a set 57:09 of functions here that allows you to do 57:11 things like write mini transactions and 57:13 build your own locks and it all works 57:15 out although requires care okay now I 57:22 want to turn to today's paper which is 57:24 crack the the reason why we're reading a 57:34 crack paper it's a couple reasons one is 57:39 is that it's it does replication for 57:41 fault tolerance and as we'll see the 57:43 properties you get out of crack or its 57:46 predecessor chain replication are very 57:49 different in interesting ways from the 57:52 properties you get out of a system like 57:54 raft and so I'm actually going to talk 57:58 about so crack is sort of an 58:00 optimization to an older scheme called 58:01 chain replication chain replications 58:08 actually fairly frequently used in the 58:11 real world there's a bunch of systems 58:12 that use it 58:14 crack is an optimization to it that 58:16 actually does a similar trick - 58:18 zookeeper where it's trying to increase 58:20 weed throughput by allowing reads to two 58:24 replicas to any replicas so that you get 58:26 you know number of replicas factor of 58:29 increase in the read performance the 58:32 interesting thing about crack is that it 58:34 does that while preserving 58:39 linearise ability 58:41 unlike zookeeper which you know it 58:43 seemed like in order to be able to read 58:44 from any replica they had to sacrifice 58:46 freshness and therefore snot 58:47 linearizable crack actually manages to 58:50 do these reads from any replica while 58:53 preserving strong consistency I'm just 58:56 pretty interesting okay so first I want 59:00 to talk about the older system chain 59:01 replication teen replication is a it's 59:10 just a scheme for you have multiple 59:11 copies you want to make sure they all 59:13 seen the same sequence of right so it's 59:14 like a very familiar basic idea but it's 59:17 a different topology then raft so the 59:21 idea is that there's a chain of servers 59:25 and chain replication and the first one 59:29 is called the head last one's called the 59:32 tail when a right comes in when a client 59:36 wants to write something say some client 59:39 it sends always Albright's get sent to 59:42 the head the head updates its or 59:46 replaces its current copy of the data 59:48 that the clients writing so you can 59:49 imagine be go put key value store so you 59:54 know if everybody started out with you 59:56 know version a of the data and under 59:58 chain replication when the head process 60:01 is the right and maybe we're writing 60:02 value B you know the head just replaces 60:04 its a with a B and passes the right down 60:07 the chain as each node sees the right it 60:11 replaces over writes its copy the data 60:13 the new data when the right gets the 60:17 tail the tail sends the reply back to 60:21 the client saying we completed your 60:23 right 60:25 that's how rights work reads if a client 60:30 wants to do a read it sends the read to 60:33 the tail the read request of the tail 60:35 and the tail just answers out of its 60:38 current state so if we ask for this 60:40 whatever this object was the tail which 60:42 is I hope current values be weeds are a 60:45 good deal simpler 60:52 okay so it should think for a moment 60:55 like why to chain chain replication so 60:59 this is not crack just to be clear this 61:01 is chain replication chain replication 61:03 is linearizable you know in the absence 61:08 of failures what's going on is that we 61:10 can essentially view it as really than 61:12 the purposes of thinking about 61:14 consistency it's just this one server 61:16 the server sees all the rights and it 61:19 sees all the reads and process them one 61:21 at a time and you know a read will just 61:24 see the latest value that's written and 61:25 that's pretty much all there is to it 61:27 from the point of view look if there's 61:29 no crashes what the consistency is like 61:34 pretty simple the failure recovery the a 61:45 lot of the rationale behind chain 61:47 replication is that the set of states 61:51 you can see when after there's a failure 61:53 is relatively constrained because of 61:56 this very regular pattern with how the 61:58 writes get propagated and at a high 62:01 level what's going on is that any 62:03 committed write that is any rate that 62:05 could have been acknowledged to a client 62:07 to the writing client or any rate that 62:09 could have been exposed in a read 62:12 that'll neither of those will ever 62:14 happen unless that write reached the 62:16 tail in order for it to reach the tail 62:17 it had to a pass through them in process 62:19 by every single node in the chain so we 62:22 know that if we ever exposed to write 62:24 ever acknowledged write ever use it to a 62:26 read that means every single node in the 62:29 tail must know about that right we don't 62:33 get these situations like if you'll call 62:34 figure seven figure eight and RAF paper 62:37 where you can have just hair-raising 62:39 complexity and how the different 62:41 replicas differ if there's a crash here 62:44 you know either that it is committed or 62:47 it before the crash should reach some 62:49 point and nowhere after that point 62:52 because the progress of rights has 62:53 always menu so committed rights are 62:55 always known everywhere if a right isn't 62:57 committed that means that before 62:58 whatever crash it was that disturb the 63:00 system the rate of got into a certain 63:01 point everywhere before that point and 63:04 nowhere after 63:04 point there's really the only two setups 63:07 and at a high level failure recovery is 63:12 relatively simple also if the head fails 63:16 then to a first approximation the next 63:19 node can simply take over his head and 63:21 nothing else needs to get done because 63:24 any rate that made it as far as the 63:27 second node while it was the head that 63:28 failed so that right will keep on going 63:30 and we'll commit if there's a rate that 63:32 made it to the head before a crash but 63:34 the head didn't forward it well that's 63:36 definitely not committed nobody knows 63:38 about it and we definitely didn't send 63:39 it an acknowledgment to the writing 63:41 client because the write didn't get down 63:43 here so we're not obliged to do anything 63:45 about a write it only reached a crashed 63:47 head before it failed I may be the 63:50 client where we sinned but you know not 63:52 our problem if the tale fails it's 63:56 actually very similar the tale fails the 63:58 next node can directly take over because 64:01 everything the tale knew then next the 64:04 node just before it also knows because 64:06 the tale only hears things from the node 64:08 just before it and it's a little bit 64:14 complex of an intermediate node fails 64:16 but basically what needs to be done is 64:18 we need to drop it from the chain and 64:20 now there may be rights that it had 64:22 received that the next node hasn't 64:24 received yet and so if we drop a note 64:27 out of the chain the predecessor may 64:29 need to resend recent rights to the to 64:33 its new successor right that's the 64:37 recovery in a nutshell that's for why 64:41 this construction why this instead of 64:45 something else like why this verse is 64:47 wrapped for example the performance 64:54 reason is that in raft if you recall we 64:59 you know if we have a leader and a bunch 65:01 of you know some number of replicas 65:03 right with the leader it's not in a 65:05 chain we got these the replicas are all 65:07 directly fed by the leader so if a 65:09 client right comes in or a client read 65:11 for that matter the the leader has to 65:15 send it itself to each of the replicas 65:18 whereas in chain replication the leader 65:20 on the head only has to do once and 65:22 these cents on the network are actually 65:23 reasonably expensive and so that means 65:26 the load on a raft leader is going to be 65:28 higher than the load on a chain 65:31 replication leader and so that means 65:33 that you know as the number of client 65:37 requests per second that you're getting 65:39 from clients goes up a raft leader will 65:41 hit a limit and stop being able to get 65:44 faster sooner than a chain replication 65:46 head because it's doing more work than 65:49 the chain replication had another 65:51 interesting difference between chain 65:53 replication and raft is that the reeds 65:55 in raft are all also required to be 65:59 processed by the leaders the leader sees 66:01 every single request from clients 66:02 where's here the head sees everybody 66:04 sees all the rights but only a tail sees 66:08 the reed requests so there may be an 66:11 extent to which the load is sort of 66:12 split between the head and the tail 66:13 rather than concentrated in the leader 66:17 and and as I mentioned before the 66:24 failure different sort of analysis 66:27 required to think about different 66:28 failure scenarios is a good deal simpler 66:30 and chain replication than it is and 66:32 raft and as a big motivation because 66:35 it's hard to get this stuff correct yes 66:45 yeah so if the tale fails but its 66:48 predecessor had seen a right that the 66:50 tale hadn't seen then the failure of 66:52 that Hale basically commits that right 66:54 is now committed because it's reached 66:56 the new tale and so he could respond to 66:58 the client it probably won't because it 67:00 you know it wasn't a tail when it 67:04 received the right and so the client may 67:07 resend the right and that's too bad and 67:08 so we need duplicate suppression 67:10 probably at the head basically all the 67:14 systems were talking about require in 67:16 addition to everything else suppression 67:19 of duplicate client requests yes pink 67:32 psyche setting in you want to know who 67:39 makes the decisions about how to that's 67:42 a outstanding question the question is 67:45 or rephrase the question a bit if 67:48 there's a failure like or suppose the 67:51 second node stops being able to talk to 67:54 the head can this second node just take 67:58 over can it decide for itself gosh the 68:01 head seems to thought away I'm gonna 68:02 take over his head and tell clients to 68:04 talk to me instead of the old head but 68:06 what do you think that's not like a plan 68:15 with the usual assumptions we make about 68:17 how the network behaves that's a recipe 68:20 for split brain right if you do exactly 68:24 what I said because of course what 68:26 really happened was that look the 68:28 network failed here the head is totally 68:31 alive and the head thinks its successor 68:33 has died you know the successors 68:35 actually alive it thinks the head has 68:37 died and they both say well gosh that 68:39 other server seems to have died I'm 68:40 gonna take over and the head is gonna 68:42 say oh I'll just be a sole replica and I 68:44 you know act as the head and the tail 68:47 because the rest of the change seems to 68:49 have gone away and second I'll do the 68:50 same thing and now we have two 68:51 independent split brain versions of the 68:55 data which will gradually get out of 68:57 sync so this construction is not proof 69:04 against network partition and has not 69:09 does not have a defense against split 69:10 brain and what that means in practice is 69:13 if it cannot be used by itself it's like 69:16 a helpful thing to have in our back 69:18 pocket but it's not a complete 69:19 replication story so it's it's very 69:24 commonly used but it's used in this 69:26 stylized way in which there's always an 69:28 external Authority you know not not this 69:33 chain that decides who's that sort of 69:36 makes a call on who's alive and who's 69:40 dead and make sure everybody agrees on a 69:43 single story about who constitutes the 69:46 change there's never any disagreement 69:48 some people think the change is this no 69:49 and some people think the chain is this 69:51 other node so what's that's usually 69:53 called as a configuration manager and 70:00 its job is just a monitor aliveness and 70:02 every time it sees of all the servers 70:05 every time Isis every time the 70:06 configuration manager thinks the 70:08 server's dead it sends out a new 70:10 configuration in which you know that 70:13 this chain has a new definition had 70:16 whatever tail and that's server that the 70:19 configuration manager thinks is that may 70:21 or may not be dead but we don't care 70:22 because everybody is required to follow 70:25 then your configuration 70:26 and so there can't be any disagreement 70:29 because there's only one party making 70:31 these decisions not going to disagree 70:33 with itself of course how do you make a 70:35 service that's fault tolerant and 70:36 doesn't disagree with itself but doesn't 70:38 suffer from split brain if there's 70:39 network partitions and the answer to 70:41 that is that the configuration manager 70:43 usually uses wrath or paxos or in the 70:49 case of crack zookeeper which itself of 70:52 course is built on a raft like scheme so 70:56 so you to the usual complete set up in 71:00 your data center is it you have a 71:01 configuration manager it's it's based on 71:04 or after PACs or whatever so it's fault 71:06 tolerant and does not suffer from split 71:09 brain and then you split up your data 71:11 over a bunch of change if you know room 71:13 with a thousand servers in it and you 71:15 have you know chain a you know it's 71:20 these servers or the configuration 71:22 manager decides that the change should 71:25 look like chain a is made of server one 71:28 server to server three chain be you know 71:32 server for server 5 over 6 whatever and 71:36 it tells everybody this whole list it's 71:38 all the clients know all the servers 71:40 know and the individual servers opinions 71:44 about whether other servers are alive or 71:46 dead are totally neither here nor there 71:48 if this server really does die then then 71:54 the head is required to keep trying 71:56 indefinitely until I guess a new 71:58 configuration from the configuration 72:00 manager not allowed to make decisions 72:02 about who's alive and who's dead what's 72:07 that 72:09 oh boy you've got a serious problem so 72:12 that's why you replicated using raft 72:14 make sure the different replicas are on 72:15 different power supplies the whole works 72:18 but this this construction I've set up 72:22 here it's extremely common and it's how 72:24 chain replication is intended to be used 72:26 how cracks intend to be used and the 72:29 logic of it is that like chain require 72:33 replication if you don't have to worry 72:35 about partition and split brain you can 72:37 build very high speed efficient 72:39 replication systems using chain 72:41 replication for example so these 72:43 individual you know data replication and 72:48 we're sharding the data over many chains 72:50 individually this these chains can be 72:52 built to be just the most efficient 72:54 scheme for the particular kind of thing 72:56 that you're replicating you may read 72:58 heavy right heavy whatever but we don't 73:00 have to worry too much about partitions 73:01 and then all that worry is concentrated 73:04 in the reliable non split-brain 73:07 configuration manager 73:17 okay so your question is why are we 73:23 using chain replication here instead of 73:25 raft okay so it's like a totally 73:33 reasonable question um the the it 73:39 doesn't really matter for this 73:40 construction because even if we're using 73:42 raft here we still need one party to 73:48 make a decision with which there can be 73:51 no disagreement about how the data is 73:53 divided over our hundred different 73:57 replication groups right so all you know 73:59 and I need kind of big system you're 74:01 splitting your sharding or splitting up 74:02 the data 74:03 somebody needs to decide how the data is 74:04 assigned to the different replication 74:07 groups this has to change over time as 74:08 you get more or less Hardware more data 74:10 or whatever so if nothing else the 74:12 configuration manager is saying well 74:14 look you know the keys start with a or B 74:16 goes here or then C or D goes here even 74:20 if you use Paxos here now there's also 74:22 this smaller question if we didn't eat 74:23 you know what should we use for 74:24 replication should be chain replication 74:26 or paxos or raft or whatever and people 74:33 do different things some people do 74:36 actually use Paxos based replication 74:38 like spanner which I think we're gonna 74:40 look at later in the semester has this 74:43 structure but it actually uses Paxos to 74:45 replicate rights for the data you know 74:49 the reason why you might not want to use 74:50 PAC so so raft is that it's arguably 74:55 more efficient to use this chain 74:57 construction because it reduces the load 74:59 on the leader and that may or may not be 75:01 a critical issue the a reason to favor 75:07 rafter Paxos is that they do not have to 75:11 wait for a lagging replica this chain 75:14 replication has a performance problem 75:15 that if one of these replicas is slow 75:18 because even for a moment 75:20 you know because every rate has to go 75:22 through every replica even a single slow 75:24 replica slows down all offer all right 75:27 operations and I can be very damaging 75:29 you know if you have thousands of 75:30 servers probably did any given time you 75:32 know seven of them are out to lunch or 75:37 unreliable or slow because somebody's 75:39 installing new software who knows what 75:41 and that so it's a bit damaging to have 75:46 every request be sort of limited by the 75:48 slowest server whereas brafton paxos 75:52 well it's so rad for example if one of 75:54 the followers is so it doesn't matter 75:56 because that leader only has to wait for 75:57 a majority it doesn't have to wait for 75:59 all of them you know ultimately they all 76:01 have to catch up but raft is much better 76:04 resisting transient slowdown and some 76:07 Paxos based systems although not really 76:09 raft are also good at dealing with the 76:12 possibility that the replicas are in 76:14 different data centers and maybe far 76:16 from each other and because you only 76:17 need a majority you don't have to 76:19 necessarily wait for acknowledgments 76:21 from a distant data center and so that 76:23 can also leads people to use paxos raft 76:25 like majority schemes rather than chain 76:28 replication but this is sort of a it 76:31 depends very much on your workload and 76:34 what you're trying to achieve but this 76:35 overall architecture is in I don't know 76:41 if it's Universal but it's extremely 76:42 common 77:02 like intentional topologies okay the for 77:07 a for a network that's not broken the 77:10 usual assumption is that all the 77:12 computers can talk to each other through 77:14 the network for networks that are broken 77:16 because somebody stepped on a cable or 77:18 some routers misconfigured any crazy 77:22 thing can happen 77:23 so absolutely due to miss configuration 77:27 you can get a situation where you know 77:30 these two nodes can talk to the 77:31 configuration manager and the 77:33 configuration managers think sir they're 77:34 up but they can't talk to each other so 77:38 yes and and that's a killer for this 77:41 right because other configuration 77:42 manager thinks that are up they can't 77:44 talk to each other boy it's just like 77:46 it's a disaster and if you need your 77:50 system to be resistant to that then you 77:52 need to have a more careful 77:53 configuration manager you need logic in 77:56 the configuration manager that says gosh 77:57 I'm only gonna form a chain out of these 77:59 services not only I can talk to that but 78:01 they can talk to each other and sort of 78:03 explicitly check and I don't know if 78:05 that's common I mean I'm gonna guess not 78:07 but if you were super careful you'd want 78:10 to because even though we talked about 78:11 network partition that's like a 78:13 abstraction and in reality you can get 78:16 any combination of who can talk to who 78:19 else and some are may be very damaging 78:24 okay I'm gonna wrap up and see you next 78:28 week