字幕記錄 00:00 good evening good night wherever you are 00:02 let's get started 00:04 so today i want to talk a little bit 00:06 about the chain replication 00:08 the paper aside for today from 2004. 00:12 uh before diving though into the paper 00:14 with a couple quick uh logistic things 00:16 uh just i want to remind you of uh one 00:19 is we have a quiz on thursday 00:23 uh and you know the instructions uh of 00:26 the topics that's actually being covered 00:28 the quizzes are on the schedule page uh 00:30 we'll try to send out uh an announcement 00:32 on piazza with more details 00:34 about exactly how we'll do the quiz 00:37 uh it's going to be a great scope and 00:41 it's you know basically during class 00:43 hours or 80 minutes um 00:45 and more but more details to follow 00:48 the second uh thing i want to remind 00:51 people off 00:52 is projects 00:54 [Music] 00:56 if you would like to do a project 00:58 instead of lab four 01:00 uh then um 01:03 uh you can do so uh but you should 01:06 submit a proposal for a project and 01:08 just a couple paragraphs uh to uh 01:11 through the submission website uh so 01:13 that we can give you feedback and so to 01:15 tell you whether this project is 01:16 actually 01:16 appropriate for a final project in eight 01:19 to four 01:20 uh if you're just planning to do log 01:21 four there's absolutely nothing you have 01:23 to do at all 01:25 uh any questions about these sort of two 01:27 logistical 01:28 uh points 01:36 okay then uh let me move on to one other 01:40 point that i wanted to bring up which is 01:42 the uh sort of a correction 01:44 uh from uh uh 01:48 a lecture a little while ago uh we uh we 01:51 walked through the 01:53 go code for the raft implementation of 01:56 2a and 2b that i had and uh we talked a 01:59 little bit about the 02:00 go defer statement and i mentioned that 02:04 you know you can actually have 02:05 the first statement inside of this block 02:07 um and 02:08 that is correct and i think it was maybe 02:10 a philippe who asked the question oh you 02:12 know what is exactly 02:13 what is that that can execute it and i 02:15 answered that question incorrectly 02:16 uh the first statement gets executed at 02:19 the point they get you return out of the 02:20 function not in your turn out of the 02:22 basic block 02:23 and so i apologize if that caused any 02:26 confusion 02:29 any questions about the 02:33 that clarification 02:37 okay good then those are the two topics 02:40 and really the 02:40 two technical topics i want to talk 02:42 about today is the zookeeper logs 02:44 which i didn't get to finish the last 02:46 time 02:48 and then i'll talk about chain 02:49 replication 02:51 all right so uh both for chain uh 02:55 replication and zookeeper you know we're 02:57 sort of still 02:58 in the same context as uh before 03:01 namely uh we're doing you know 03:04 replicated state machines 03:07 and you know just the usual diagram 03:10 uh you know we have a service you know 03:12 that runs on some you know replication 03:14 library like you know zap 03:16 or raft 03:20 we have clients you know talking to uh 03:22 the service for example in the case of 03:24 you know 03:24 a zookeeper they might actually send the 03:26 create call 03:28 you know if you keep your internally 03:29 house from state you know some z 03:31 notes you know that are hanging off in a 03:33 form a tree 03:34 and so when an operation comes in you 03:36 know superkeeper you know 03:38 forwards that operation basically to the 03:40 rav zab library 03:41 it does some chatting back and forth and 03:44 to 03:44 get a majority of the servers to accept 03:47 you know that uh 03:48 command and then at some point you know 03:50 once it's accepted it comes out 03:51 uh the server applies the operation and 03:54 sends a response back to the client 03:56 and so the standard story of uh 03:58 replicate the state machine if you start 04:00 all the replicated state machines start 04:01 in the same state you apply the same 04:03 operation in the same order 04:05 then you will end up in the same state 04:06 and so any of the machines can then take 04:08 over if necessary 04:11 now one of the things that was cool 04:13 about or interesting about 04:14 zookeeper is that you know read 04:16 operations can uh be served from any 04:18 uh service or for any of the peers or 04:21 any of the one of the servers 04:23 and this allows you keeper to get 04:25 extremely high performance you know 04:26 because you can actually uh 04:28 scale the number of viewed operations 04:30 with the number of servers 04:32 uh as a flip side of that uh you know 04:35 zookeeper 04:36 actually gave up on in that particular 04:38 scenario gave up on linearizability 04:49 because we know from you know for 04:51 example in raft you know you can't 04:52 actually 04:53 arbitrarily serve a weed from any server 04:57 because i have not seen the latest 04:59 updates yet 05:00 and so in case of zookeeper that is true 05:03 too 05:04 and uh and so read operations 05:07 are not you know or the the operation 05:10 that supergroup defines 05:11 don't provide a linearizable interface 05:13 um but nevertheless you know we saw 05:16 that actually is uh it provides a 05:17 slightly different sort of correctness 05:19 guarantee than linearizability 05:21 uh and that actually correctness um 05:24 is useful uh and useful enough you know 05:26 to be able to write x-men programs and 05:28 the particular 05:29 set of programs that you know zookeeper 05:31 was you know focusing on 05:33 is uh these what they call configuration 05:38 or coordination 05:50 programs 05:52 and so the definitely think about it is 05:53 that a lot of systems that we 05:55 looked at in the past they typically 05:57 have you know some replication story and 05:58 then they have 05:59 a coordinator or a master 06:02 uh that sort of coordinates the the 06:04 group and 06:05 zookeep is really intended you know for 06:07 as a service you know for that kind of 06:09 you know master or coordinator role and 06:12 it provides a bunch of primitives you 06:14 know to 06:14 make that uh doable and uh the two 06:18 you know we talked a little bit about 06:19 atomic increment uh last week 06:22 uh and uh with some of the other 06:24 surfaces and i want to sort of 06:25 finish off you know talking about box um 06:30 one because there were a lot of 06:31 questions about it and two it actually 06:32 is quite interesting 06:33 and there's sort of two different 06:35 walking team limitations and the first 06:36 one to talk about the simple one 06:38 namely where uh let me just write down 06:41 the 06:41 pseudo code and then we can talk about 06:43 it in a little bit in detail 06:45 uh so the pseudo total code for 06:49 the block was something like this you 06:51 know acquire 06:52 uh in an infinite loop you know try to 06:55 create the log 06:55 file 07:00 i'll name it lf and set ephemeral to 07:04 true 07:11 and we'll talk about that in a second 07:12 why you know if the create succeeds 07:16 then you know the that process where 07:18 that client was the first one to create 07:20 that file and basically 07:21 successfully gets the lock and so it 07:24 breaks out of the for loop and returns 07:26 um if the 07:30 client did not you know be able to 07:31 create a file then a call 07:33 exists and the call to exist is not 07:36 really for 07:37 you know see if it exists or not because 07:39 you know we know it doesn't exist 07:41 but basically we set a watch 07:47 and the idea is that you know the watch 07:50 will go off 07:51 if actually the file disappears 07:54 and if it disappears then you know the 07:56 client will get a notification and so 07:57 basically 07:58 we're going to be doing here is just 08:00 wait for that notification 08:09 so that's sort of the acquire operation 08:11 then the release 08:13 is very simple 08:16 the release basically is does nothing 08:18 else than sending a delete 08:20 operation to the zookeeper service you 08:22 know for the for the log file 08:25 so lf in this case and so what does that 08:28 do well if the 08:30 delete you know sent to the zookeeper 08:31 servers you know the zookeeper serves 08:33 the 08:34 performance the ddop delete operation 08:36 that will actually 08:39 let the that make all the file would go 08:41 away uh that will fire 08:43 the watch and so every client that 08:45 actually is waiting 08:46 you know for uh the notification will 08:48 get a notification 08:50 and then they go retry and uh you know 08:53 one of them will be successful on the 08:54 retry we'll get the log file or create 08:56 the lock file 08:57 and then proceed and then the others 08:58 ones you know we'll go back into 09:00 uh their uh call exists and i'm gonna 09:04 wait for 09:05 a notification and the zookeeper 09:08 semantics you know are good enough 09:10 you know the the strong the the the the 09:12 limit the linearizability for write 09:14 operations 09:15 plus you know the rules for when 09:17 notifications go off 09:18 are strong enough that basically this 09:20 actually implements a faithful lock 09:21 where only 09:22 one client if there are many clients at 09:24 the same time trying to get the lock 09:25 only one will get it 09:27 and you know when the release is done or 09:29 when the 09:30 file has been deleted only one in the 09:32 next round will get it 09:35 so that's sort of cool uh and it's 09:37 interesting that you should 09:38 build uh you know sort of a sort of this 09:41 foundational primitive 09:42 uh using the primitives that the 09:44 zookeeper offers 09:46 and you see above the role of the you 09:49 know the the watch 09:50 and then there's the second row for the 09:52 elephant you know the ephemeral 09:54 is there because uh what happens if a 09:57 client 09:57 uh fails or crashes before it causes 09:59 release and 10:01 the semantics of the femoral files is 10:03 that the zookeeper server will 10:05 if it decides the client has crashed it 10:07 will do the operation 10:09 uh it will remove the file on behalf of 10:12 the 10:12 client so even if the client fails or 10:15 crashes 10:16 the search of the site and at some point 10:18 the client is done 10:20 and then we'll remove you know this file 10:21 lock f which will cause notifications to 10:24 be sent to 10:24 other clients that actually are waiting 10:26 for it uh so 10:28 it's a you know a cool set of primitives 10:30 to build in a powerful abstraction that 10:33 can be useful in applications one 10:36 downside of this particular 10:37 implementation 10:40 is that it has what's called a herding 10:42 effect 10:44 namely you know let's say you have a 10:46 thousand clients don't want to 10:48 grab the log file or make the log file 10:50 acquire the lock 10:52 uh no one is going to succeed and 999 10:55 are going to call uh exists and wait for 10:58 a notification 10:59 then when the first client deletes the 11:02 file or releases the file lock 11:05 999 are going to try to actually acquire 11:09 the lock and of course only one is going 11:11 to succeed and then 9998 they're going 11:13 to be sitting for a notification 11:15 uh but you know the the this you know 11:18 every sort of 11:19 round of disappearance a huge amount of 11:21 traffic uh 11:22 and uh you know basically bombarding you 11:25 know the zookeeper surface right because 11:26 there are 900 11:30 you know all but one uh uh are gonna 11:33 fail 11:34 and so that's a little bit of an 11:36 undesirable uh property 11:37 uh this hurting effect uh and it's a 11:40 real problem in practice you know both 11:42 on 11:43 uh small scale multi-core machines as 11:46 well of course in a setting like this 11:47 where you know network 11:48 messages are not free so it's 11:51 interesting that 11:52 actually zookeeper provides enough 11:53 primitives that you could actually do 11:55 quite a bit better 11:56 you can actually build a lock that 11:57 doesn't have suffer from the herding 11:59 effect so a better lock 12:08 and this is interesting let me uh pull 12:11 up the 12:13 pseudocode for this which is in the 12:15 paper 12:16 so that we can look at it and discuss 12:19 you know why you know this lock is 12:20 better 12:21 and particularly what we'll see is that 12:23 this lock is better because basically 12:25 there's no 12:26 there's no retry where all clients that 12:30 didn't succeed 12:31 getting the lock will retry to try to 12:33 get it instead you know basically 12:35 the all the clients are sort of foreign 12:38 and 12:38 they get the log one by one and the way 12:41 you know 12:42 that you can program that using 12:43 zookeeper's primitives is in this 12:45 particular 12:46 pseudocode and there are a couple 12:48 differences 12:49 compared to the previous one first of 12:52 all 12:52 the 12:56 there's an additional flag you know 12:57 passed to create namely 12:59 sequential which basically means that 13:01 every 13:03 these files are created the bug file is 13:05 created but it will be created as 13:07 you know the first one will be log0 then 13:10 the next one will be lock one 13:12 etc 13:16 so we have like you know if i was a 13:18 client just rushing you know to the 13:19 servers to actually try to acquire the 13:21 lock 13:22 basically a thousand files will be 13:24 created all numbered from zero to 13:26 99.99 then 13:29 uh so i'll succeed in creating a file 13:33 uh the the create returns actually the 13:36 number that the that you got so if the 13:40 client zero 13:41 if the first client gets no block 13:43 creates log zero 13:44 then it will get a zero back and the 13:46 second one get a one back et cetera et 13:48 cetera 13:49 then you know the sugar could basically 13:51 ask you know to get all the children in 13:53 that 13:54 directory under which you know these 13:56 files are created and so in this case 13:57 maybe there'll be a thousand 13:59 uh children a thousand z nodes uh 14:02 and then you can just look at the n and 14:05 uh c of your n in this case zero is the 14:09 low 14:09 z note in c uh and if that's the case 14:12 that means you got the lock 14:13 and so that makes totally sense correct 14:15 the first line gets actually a zero back 14:17 uh all the other clients have a higher 14:18 number because they're sequentially 14:20 numbered 14:20 and so the first client will succeed in 14:22 getting it and all the other ones 14:24 what they're going to do is they're 14:26 going to 14:27 look at they're going to find the p you 14:29 know the number that's right in front of 14:31 them so for example if this is client 14:34 that got back you know log 10 it's going 14:36 to look you know for 14:38 z node 9 you know log 9 and basically 14:40 put a watch on that file 14:45 so this means that every client uh will 14:48 have a watch 14:49 basically on its previous session so 14:51 here you're going to see that sort of 14:52 all the clients form a line 14:54 and um and then you know the client is 14:57 going to 14:58 wait for that notification to go off and 15:01 so that means for example when client 15:03 zero 15:03 you know we've got the zero back you 15:05 know uh releases the lock you know we'll 15:07 delete in 15:08 uh this will get a notification to go 15:10 off for the file 15:11 one and so the client that's actually 15:14 waiting for that particular notification 15:15 then 15:16 uh will run but it's the only one it 15:19 runs 15:20 and uh and it will succeed 15:25 and so here we see you know this is some 15:27 sometimes these are types of blocks are 15:28 called ticket locks 15:30 in uh multi-core programming if you're 15:32 familiar with them and they have sort of 15:34 the same uh 15:35 the sort of the same idea of certain 15:36 ticket locks exceptions are built in 15:38 into this using zookeeper primitives 15:42 and again what is interesting about it 15:44 is that 15:45 you know these primitives are powerful 15:46 enough that you can actually build these 15:47 kind of blocks 15:51 any questions about this 15:57 okay i want to make one more comment you 15:59 know about the uh these locks 16:01 uh before moving on to uh chain 16:04 replication 16:05 we have a question in the chat yeah okay 16:07 what's the question in the chat 16:08 what is the watch for online for 16:12 good go back on this four line four 16:17 yeah i think that's the question 16:20 [Music] 16:21 that watch my comment of the watches 16:23 actually should go with you know 16:25 line five so there's no watch on line 16:27 four correct 16:29 uh line four just finds p you know the 16:31 number that's right before 16:33 you know you're in 16:39 if that doesn't answer the question 16:40 please uh you'll come back later that's 16:42 fine 16:43 i actually have another question yeah 16:45 this is going actually i think back a 16:47 few slides but 16:48 how does zookeeper determine that the 16:50 client has failed and thus released the 16:52 ephemeral lock 16:53 like if it it's just like partitioned 16:56 for a moment 16:57 yeah so uh well so that could be 17:00 happening so the client might actually 17:01 so the client has a session 17:02 right with the zookeeper service 17:06 and the client needs to actually the 17:09 zookeeper and the client basically 17:10 saying heartbeats to each other 17:12 and if the zookeeper service doesn't 17:15 hear from the client for a little while 17:17 then it just decides the client is down 17:19 and closes the session 17:21 and so the client can try to send 17:23 messages on the session but the session 17:24 is just closed it's gone 17:26 and any files that were created in the 17:29 emphasis that were created during that 17:31 session are basically deleted 17:34 and so if the network recon uh 17:36 reconvenes or 17:38 reveals then the client can will try to 17:41 send messages over that session and 17:42 basically those upper servers will say 17:44 like ah 17:45 that session doesn't exist anymore you 17:47 got to retry or restart a new session 17:50 got it thank you okay good 17:54 so there's one important point about 17:55 these uh what i call z 17:57 locks where z zookeeper locks 18:00 and that is they're not the same or have 18:03 similar semantics 18:05 like the locks that you were using or go 18:06 locks or mutexes 18:09 and it's sort of an important point to 18:10 realize even though they're different 18:12 we'll see in a second they're still 18:13 youthful 18:14 but they're not as strong as the sort of 18:16 the go locks 18:18 in particular the case that is 18:19 interesting is like when the lockholder 18:21 fails 18:23 so if the lock holder fails 18:28 basically zookeeper decides that 18:29 lockheed holder has failed correct as we 18:31 just discussed 18:32 then it is possible that we're going to 18:35 see some intermediate state 18:42 and remember like the whole rule that 18:44 locks is like you know it's a critical 18:45 section 18:46 you know you're some invariant is true 18:48 you know while you're going 18:49 or through the critical sections on that 18:51 invariant might not be true but then at 18:52 the end you 18:53 re-establish new variants uh 18:57 in here it's the case like you're 18:58 required to walk a client requires walk 19:00 does some steps 19:01 and then you know maybe zookeeper 19:02 decides that the client is decided to 19:04 decline this crash 19:05 and uh basically revokes lock but you 19:08 know the state you know the system might 19:09 actually 19:10 be in some intermediate state for return 19:11 on the invariant was not true 19:13 right so it's not the case that 19:15 basically uh these logs guarantee 19:17 adamicity of a 19:19 critical section uh so what 19:22 so what they do what they're useful for 19:29 is for some other purposes 19:33 in fact there's sort of two primary use 19:35 cases one i think is leader election 19:43 so uh basically if we need we have a set 19:46 of clients that need to select a leader 19:47 among them you know we can just 19:48 they can all try to basically create the 19:51 log file 19:52 one of them succeed that's basically 19:54 then become the leader 19:55 and that leader you know could clean up 19:57 any intermediate state 19:58 if possible if necessary 20:02 or you know do these atomic updates 20:04 using the ready trig 20:05 where basically you do a bunch of writes 20:07 you know to some file but then you 20:08 expose the file only at the very end 20:10 and that way make a set of uh writes 20:13 actually a sort of more transactional 20:15 work so that's one use case for these 20:18 kind of locks 20:19 the second use case is what i will call 20:21 soft locks 20:27 the soft locks you know the way to think 20:28 about it is that uh 20:30 let's say we have a worker like in the 20:33 macro do style 20:34 and the map we want to basically arrange 20:36 that you know every 20:38 uh worker executes a particular map task 20:40 only once 20:41 uh and um and so now one way to do that 20:45 would be basically take a lock out 20:46 you know for that particular uh input 20:49 file 20:49 uh run you know the computation and then 20:52 uh 20:53 once the mapper is done then you release 20:55 the log file so this will cons 20:57 this will cost only one mapper to in the 20:59 common case to execute 21:02 a particular task and um and that's 21:05 exactly sort of what we want but of 21:06 course you know if 21:08 the mapper would fail then the lock will 21:10 be released and then you know we might 21:12 execute it a second time because 21:13 somebody else will try to require to 21:15 lock and so in that case you know for 21:17 the case of mapreduce that's perfectly 21:18 fine correct 21:19 the it's okay if the task gets executed 21:22 twice 21:24 it happens twice 21:29 and in some ways what it really is it's 21:31 more performance optimization that in 21:32 the usual case you want to take uh 21:34 you want to have it actually it's 21:35 actually could do that only once uh but 21:37 you know if there's a failure you know 21:38 it might be the case that you're 21:39 executing a map drop twice 21:41 and then that thing you know and the 21:42 mapreduce usually meant to set up in 21:44 such a way that actually that is 21:45 uh that's okay and so in those cases 21:48 these sort of locks are really useful 21:50 too 21:52 any questions about this about the 21:55 this perspective on locks you know that 21:57 the zookeeper walks are not exactly like 21:59 the go 21:59 locks and you know it's just an 22:00 important thing to keep in mind 22:03 all right go ahead alexander 22:06 i guess uh yeah i had a question you 22:08 said that uh one of the differences is 22:11 that um 22:12 in in z locks the 22:16 if if the if the server holding the lock 22:20 dies then the lock can be revoked but 22:24 does that still happen if you don't pass 22:26 the because there's that flag 22:30 called uh yeah fmero yeah fema yep 22:33 that have only happens with the 22:34 ephemeral fire right so can can we just 22:37 like 22:38 um emulate the golocks by not passing 22:41 ephemeral 22:42 okay good what would happen then 22:46 so you created basically a persistent 22:47 file the client 22:49 dies you sub deadlock 22:53 and so the lock will keep on existing 22:55 and nobody will release it 22:58 and we have a deadlock 23:01 because the one person that could 23:03 release it is dead 23:06 or crashed and in fact this is like why 23:09 i get ephemeral partners there 23:13 um is is it actually the only 23:16 the only person who could release it 23:18 because anyone can delete that file 23:20 right 23:20 because you have like a background like 23:23 but that would break 23:24 but that break you know maybe the other 23:25 clients are still running you know also 23:27 still thinks it holds a lot 23:32 that's true now you guys are getting a 23:34 sort of this 23:35 basically you're this is the consensus 23:36 problem all over again right 23:39 uh and you know we 23:44 so this is sort of a clean way to get 23:46 most of it uh but not all of it 23:49 if you will and i think you know if you 23:51 want to make things atomic across a 23:52 number of you know a set of rights 23:54 atomic 23:55 you know you basically use this trick of 23:57 um 23:59 uh you use this trick of basically 24:02 the ready trick where you do a bunch of 24:04 writes and then you expose them at the 24:05 same time 24:10 could you explain the soft locks again 24:13 uh okay soft locks uh means that 24:15 basically an operation can happen twice 24:18 uh and so in the common case if there's 24:20 no crashes it will happen once 24:22 you know because the the client will 24:23 take a lock out they'll do the operation 24:25 release 24:27 and but if their client failed uh 24:30 halfway through for example then the log 24:32 would be automatically released by 24:33 zookeeper and then maybe a second 24:35 you know client will execute the same 24:36 map task 24:41 so in the case of later election i uh 24:44 what's the intermediate state that could 24:45 get exposed here 24:47 it seems that uh the first yeah okay in 24:49 the pure leader election there would be 24:51 no intermediate state but typically the 24:52 leader will create a configuration file 24:54 right as we saw in zookeeper where you 24:56 know uh 24:57 using the ready trick i see and so you 25:00 just write the whole file and then 25:02 convert it atomically as we 25:03 name it okay thank you um sorry could 25:07 you 25:08 explain what the ready trick is 25:12 i was hoping not to because i think we 25:14 talked about it last time sorry 25:16 all right so uh maybe we 25:19 you can hold that question and well i'm 25:20 happy to do it at the end of the lecture 25:21 again 25:26 because otherwise i have little time to 25:27 actually talk about uh chain replication 25:35 any other last-minute questions 25:40 okay let me set the chain uh the states 25:42 were chain replication a little bit 25:44 and that will also come back to 25:45 zookeeper in some sense 25:47 uh and basically it turns out there's 25:50 sort of two 25:50 common approaches to build replicated 25:53 state machines 25:54 and we really haven't called out these 25:55 two approaches you know we've seen them 25:57 but i've really talked 25:58 explicitly about them and i want to do 26:00 this at this time explicitly 26:01 so 26:05 because there are some interesting 26:06 observations to be made 26:08 approaches to building replicated state 26:10 machines 26:14 and the first one is 26:18 the one we basically have seen in the 26:19 labs which is you run 26:21 all operations 26:25 you know through raft 26:30 graft which you keep a raft or you know 26:32 access whatever 26:33 you know consensus you know distributed 26:36 consensus algorithm that you're using 26:38 and so this is sort of like the key 26:39 value store right in 26:41 lab 3 where you know you do put a get 26:43 operation you run all the put in get 26:45 operations through raft 26:47 and you know the surfaces basically 26:49 update you know the key 26:51 to our state as the operations are 26:53 coming in 26:54 on the applied channel and uh 26:57 you know and basically we have our 26:59 replicated state machine 27:00 so this is sort of like how lab3 works 27:06 it turns out that style where basically 27:09 raft is used to also 27:10 uh run all the operations it's actually 27:12 not that common 27:13 uh we'll see some other designs later in 27:15 semester to do that too like 27:16 spanner does it but it's not actually 27:19 completely 27:20 the standard approach or the more common 27:22 approach actually is to 27:24 have a configuration server like 27:26 zookeeper 27:33 service and the configuration service 27:35 itself internally you know might use 27:37 paxos raft or uh 27:41 uh or zap or whatever and uh and 27:44 really the configuration services really 27:46 plays the role of the 27:48 coordinator or the master like the gfs 27:50 master 27:52 and in addition to basically having 27:54 configuration services actually 27:56 implemented 27:57 uh using you know one of these uh rav 27:59 texas algorithms 28:01 you actually run a primary backup 28:04 application 28:12 and so think about gfs you know or that 28:15 we saw early in the semester 28:17 has that sort of structure right in the 28:18 gfs that was a master 28:20 and that basically determined you know 28:22 which set of servers hold the particular 28:24 chunk 28:24 and so basically determine the replica 28:26 group for chunk 28:28 and then the replica chunk group 28:30 basically executed the primary backup 28:32 replication 28:33 one of the chunks was the primary and 28:35 the other ones were the backups and they 28:37 basically had a protocol that they used 28:38 for primary backup replication 28:41 you could think about vmvt in a similar 28:43 style where the configuration server is 28:45 basically a test and set server you know 28:46 which basically recorded who is actually 28:48 the primary 28:49 and then the primary and backup have a 28:50 protocol to basically send you know 28:52 channel 28:53 operations down to channel and so that 28:54 the primary backup is sort of 28:56 roughly in in sync and implement a 29:00 replicated state machine 29:02 and this this approach you know tends to 29:04 be sort of the more common approach 29:10 although you know the approach number 29:12 one also happens 29:14 uh one we one way to think about this uh 29:17 is that you know um if 29:20 the raft states like for example our key 29:22 value survey laptop would be gigantic 29:24 you know have a 29:24 huge amount of this state you know 29:26 terabytes of key value server 29:28 would rather be very good match for that 29:30 kind of application 29:34 or what is the risk 29:38 or the potential problem 29:47 um we flush the log very often so 29:51 maybe that could be problematic that 29:53 could be problematic yeah like 29:55 what what is it what's the size of the 29:57 checkpoint 29:59 if our key value server is really big 30:04 it's linear in the size of the key 30:06 values yeah so the techniques could also 30:08 be gigantic right so if any 30:10 you know if any time the checkpoint has 30:12 to be sent you know it's going to be a 30:13 big checkpoint 30:14 and sort of like graph it's not really 30:16 sort of set up you know 30:19 and so the primary is going to 30:20 communicate you know the new primer is 30:21 going to communicate these 30:22 snapshots as you did in lab2d you know 30:24 to the the backups and you know they're 30:26 going to be big 30:27 and so you often want to maybe a little 30:29 bit more clever plan 30:31 to sort of synchronize you know 30:32 re-synchronize new servers 30:34 and uh so one reason that basically 30:38 often these things are split into two 30:40 different pieces where there's the 30:41 configuration servers that basically 30:42 small in terms of state 30:44 and then a primary backup plan that 30:46 actually you know may 30:47 replicate a huge amount of data and so 30:50 this is why one reason you see sort of 30:52 both approaches 30:54 does that make sense i'll come back to 30:58 that at the end of the lecture again 30:59 uh but it's important to keep this in 31:01 mind so what approach 31:03 or what benefits does one give over to 31:07 uh you don't have to have two of them 31:09 right in one you basically have raft you 31:12 run the operation for a fruit and an 31:13 industrial configuration for you too 31:15 so everything is in a single single 31:18 component 31:20 and here in number two we have two 31:22 components you know we have a 31:23 configuration service that includes 31:24 draft 31:25 and we have a primary backup scheme 31:30 so maybe this will become become more 31:31 clear as i talk about chain replication 31:36 um yeah i had a really quick question so 31:39 like 31:39 essentially for two i guess would like 31:42 what the advantage be that you have like 31:44 like it's consensus reached through the 31:46 leader and the leader never fails right 31:48 like 31:50 yeah so the advantage of two is that as 31:52 we'll see the second in chain 31:53 replication 31:54 is there's sort of a separate process 31:55 that takes care of the figuration part 31:57 and you just don't have to worry about 31:59 it in terms of your primary backup 32:00 replication scheme 32:02 uh and that just decides like in gfs 32:05 that's sort of like the master it just 32:06 decides to have a year to set the 32:08 servers that form this particular 32:09 replica group 32:10 and the backup primary backup protocol 32:13 doesn't have to think 32:14 about this thanks 32:17 and so and this is a good introduction 32:19 to chain replication because the chain 32:21 replication is exactly 32:22 uh sort of a primary backup 32:26 replication scheme for approach two 32:35 and that is to say that you know uh 32:37 chain replication assumes 32:38 there is a very configuration servers 32:42 uh i think they're called the master 32:44 processing the paper 32:46 then chain replication themselves 32:47 there's a couple cool properties 32:50 one read operations or 32:54 they call them query operations 32:57 involve only one server 33:01 namely the tail as we see in a second 33:06 another nice property about chain 33:08 replication that has a very simple 33:10 recovery plan 33:13 and we're going to talk about all these 33:14 in more detail in a second 33:18 and then presumably something that you 33:20 started having 33:22 you you appreciate given the fact you 33:23 know how complicated it can be in raft 33:26 uh and you know it provides actually 33:28 strong 33:30 a strong uh properties namely 33:32 linearizability 33:34 for the put in the get operations uh and 33:36 finally just a lot of people ask this 33:38 you know there's actually a reasonable 33:40 influential design 33:43 and it's used by quite a number of 33:44 systems 33:46 uh so this has been used in practice 33:49 so this is so i'm going to talk about 33:51 each of these components in a little bit 33:52 more detail 33:53 and and then we'll come back to this 33:55 sort of approach one versus approach too 34:00 uh so in terms of an overview 34:06 oops what happens here if there's an 34:09 overview 34:11 user delay of the land you know there is 34:12 a a massive process 34:17 or a configuration service 34:23 and that basically keeps track you know 34:25 which servers you know belong 34:27 you know to a particular chain so 34:30 s1 s2 s3 you know basically have a 34:33 record of what the chain is 34:34 who the head is and who the tables and 34:37 so 34:38 that's the configuration server and here 34:40 we do actually have our servers you know 34:41 s1 34:43 s2 s3 34:49 and one of them is the head 34:53 typically the one with the smaller 34:55 number and 34:56 one is the tail 35:00 and so if we have a client 35:03 the client may talk to the configuration 35:04 server learn you know who actually 35:07 is part of the chain uh and then it 35:10 sends a write request 35:11 to the head so this is the protocol 35:14 in chain verification the right request 35:16 always goes to the 35:18 uh the head and what the head does the 35:21 head basically pushes 35:22 you know the head actually applies the 35:23 operation uh on its 35:25 state and maybe it has a disk you know 35:26 associated with it you know where 35:28 whatever stores key value server on it 35:30 and then it sends the 35:32 update you know the result of the uh 35:34 operation 35:35 down the chain in fifa 35:39 order and reliably so s1 will send the 35:42 update to 35:43 s2 it has to have made his own disk 35:46 apply the 35:47 operation word state change to its 35:50 state once it actually supplied it you 35:53 know it will forward it to the 35:56 last uh node in the chain and this can 35:58 be because there are only three nodes in 35:59 this particular chain 36:00 you could have chains that are longer 36:01 you know if you want more availability 36:04 um and when the last node gets the 36:08 message or the state change uh it 36:10 applies it to you know its state 36:13 and then now this is in charge actually 36:15 we're sending an acknowledgement back 36:16 you know to the client 36:20 so it's the tail who sends the 36:21 acknowledgement back 36:23 um and so uh one way to think about this 36:28 is that when the 36:29 tail or in this case s3 you know 36:31 actually applies to state change 36:33 that's sort of the commit point 36:38 and the reason is is the commit point is 36:40 because subsequent reads 36:43 always come from the tail so if anybody 36:45 or any other client 36:47 you know does a read operation they 36:50 always go to detail 36:52 and the tail basically responds 36:53 immediately back to them 36:55 so read operations go to detail so here 36:57 is client one here's point two 36:59 client two does read operation uh it uh 37:02 goes to the tail the tail responds and 37:04 that's it 37:06 and so there's a couple things that i 37:07 want to sort of point out um 37:10 the one one of the interesting points 37:12 out is that the read operations just 37:13 involve one server 37:15 right and like and if you remember from 37:17 uh lab three or if you're in progress if 37:20 you're 37:20 starting to do lab three read operations 37:22 actually involve 37:24 uh you know in our implementation read 37:26 operations go through the 37:28 raft log and all that kind of stuff 37:31 the paper discusses in optimization but 37:34 the read operation always goes to the 37:36 leader 37:36 and the leader first has to you know 37:38 contact the majority of the servers 37:39 before it can execute the operation 37:41 locally 37:42 uh so what you see here is that the read 37:44 operations actually go through 37:45 completely so 37:47 to a different server from write 37:49 operations so the read and write 37:50 workload is actually spread at least 37:52 over two servers 37:54 furthermore the read operation involves 37:57 only one server 37:58 there's never he doesn't have to talk to 38:00 any other server it can just respond 38:01 immediately and we'll see a little bit 38:03 later 38:03 why this is actually uh important or why 38:06 this is actually what what further 38:07 optimizations this 38:08 allows and so the commit point is really 38:12 you know 38:12 at the point that the right actually 38:14 happens at the tail end 38:15 because at that point the right 38:17 operation is visible to 38:19 readers and not before any other point 38:23 and this also you know provides this 38:25 linearizability so it's pretty easy to 38:27 see that in the case of no crashes you 38:28 know this 38:29 scheme guarantees or linearizability 38:31 because the rights are all applied in 38:33 some total order 38:34 at the head and 38:37 when the tail receives you know that 38:39 update it's the commit point 38:41 it wants to respond you know to the 38:43 client and send the request back 38:45 if you know that same client immediately 38:46 does a read operation 38:48 it will go to the tail and it will 38:50 observe the last change 38:52 so uh certainly within a single client 38:54 basically all 38:55 operations are totally ordered it's 38:57 pretty easy to see that 38:59 incline if client two start to read 39:01 operation after 39:02 you know clients one uh operation has 39:04 finished and when is it finished then 39:06 when it the tail has responded 39:07 so any read operation that starts after 39:09 a write operation 39:11 will observe you know the last or the 39:13 result of the most recent right 39:15 and so it's pretty easy to sort of get 39:16 an intuition here that you know this is 39:18 going to 39:18 provide us with later activity 39:22 okay so uh what i'd like to do now is 39:25 actually you know take a quick breakout 39:27 room uh section and where i would like 39:29 you to discuss 39:30 the uh question that wasn't a post in 39:32 lecture 39:33 you know what could go wrong or like 39:36 would a great linearizability if instead 39:39 of having the tail 39:40 respond to the client uh have the head 39:43 respond to the client immediately after 39:45 it has received you know the 39:48 uh write request 39:51 and maybe that's a good topic sort of to 39:53 debate a little bit and if you want to 39:55 go in any other direction 39:56 to talk about the chain replication 39:57 portrait welcome but maybe that's 39:58 something to start with 40:00 so let's take a five minute breakout 40:02 room and then 40:03 uh we'll do this and i think 40:07 uh let me see jose are you gonna do it 40:10 um yeah and okay 40:13 do i have to make you something or i 40:15 don't think it's necessary 40:17 um i think it zoom changed so it should 40:19 be possible now too 40:25 [Music] 40:27 yep that's right 40:31 yep okay 42:14 uh 42:56 [Music] 47:43 hey okay are we coming back uh yeah 47:45 whenever you 47:46 i'm ready okay and then i think i can 47:50 close 48:18 so 49:00 back 49:07 okay good uh so you know just very 49:10 quickly uh to summarize you know why 49:12 uh you know that would break 49:14 linearizability yeah so the 49:16 the protocol change change that was 49:18 contemplated was to 49:20 you know both you know keep propagating 49:22 to s1 and s2 and s3 49:24 uh but you know as soon as s1 actually 49:26 is done uh 49:27 with its propagation the that responds 49:29 back to the client 49:31 and clearly that is will would break 49:34 linearizability because 49:35 uh let's say this client wanted to write 49:37 got the announcement back 49:38 you know from s1 as one of course has 49:41 you know the right also in progress 49:42 s2 and s3 but maybe before you know s2 49:46 actually contacts s3 49:47 the client actually sends a read 49:49 operation you know to 49:51 uh s3 and of course now it will return 49:54 uh 49:54 the value from before the write was done 49:56 so the client doesn't even observe its 49:59 own rights and so that would clearly 50:01 break linearizability so it's very 50:03 important 50:04 that you know as i said earlier that 50:06 these the tail 50:08 actually sends the acknowledgement back 50:10 you know to the client 50:12 because really the once the tail has 50:14 processed the 50:16 right operation that is actually really 50:18 what the commit point is 50:22 any questions about that 50:28 okay now so this is sort of normal 50:30 operation 50:31 and i want to talk a little bit about uh 50:33 crashes uh you know 50:34 since it's 8-4 distributed systems so 50:37 all the actions is where when the 50:39 failures happen 50:43 and one of the things that is cool about 50:45 chain replication 50:46 is you know the number of failure 50:48 scenarios is actually quite limited 50:50 and so let me uh so basically the three 50:53 cases namely the head fails 50:55 the one of the intermediate search fails 50:57 or the tail fields 50:58 so let's look at one of each one each of 51:00 those cases so 51:02 here's our they're the case we have 51:04 ahead it's s1 51:06 let's say that it applied u1 u2 and u3 51:10 for free updates 51:12 uh talks to s2 well maybe s2 has done u2 51:17 uh and u1 and you know 51:21 where we have s3 which is detail and it 51:24 only has done year one so far 51:26 and so the client was the client was 51:28 talking to 51:29 s1 um and 51:33 we now want to think about like what 51:34 happens uh what needs to happen 51:36 if one of these crashes and so let's 51:38 start with the case that the 51:40 head crashes and so the head crashes 51:44 what needs to be done 51:47 this is an easy case this is a hard case 51:52 it's easier i hope it's an easy case why 51:55 flip it 51:56 uh you can just cut off um the head 52:00 oh sorry the yeah the head and you know 52:03 make 52:04 uh s2 the head now yeah we can just 52:07 promote 52:07 you know so what's going to happen 52:09 correct is that the configuration server 52:10 discovers that s1 is gone 52:12 or decides that s1 is gone and uh 52:16 then uh basically uh can promote s2 to 52:19 be the 52:20 head uh in subsequent uh operations 52:24 and clients now in the future then talk 52:26 to this guy 52:27 and why is this correct so what 52:30 operation have we lost 52:32 we lost you three yeah is that a problem 52:36 that's valid to lose operations 52:40 yeah it's fair game to lose you're free 52:42 correctly your free has not been 52:43 committed 52:44 because only operations at the tail are 52:45 committed and so it's just as if the 52:48 operation never happened you know the 52:49 client could not even have observed you 52:51 know that actually you 52:52 uh that this uh u2 or you're free 52:55 actually has happened or you three yes 52:57 okay so it's perfectly fine to do this 53:00 why is it important that the 53:01 configuration server is actually 53:02 involved here could like stu ii decides 53:05 on its own to become the head 53:07 let's say s2 couldn't talk to he has one 53:08 anymore and decides like ah whatever i 53:10 want to become head 53:11 would that be valid 53:14 wouldn't that uh like maybe create a 53:17 split 53:17 uh yeah yeah that would create a split 53:21 break right because it might you know f2 53:23 might then just be partitioned 53:24 from s1 and so now both are heads 53:28 and maybe both are processing commands 53:29 you know we you know violate our 53:32 uh you know basically sort of this whole 53:35 property of having a total order 53:38 you know um does this to even know that 53:42 s1 is ahead 53:43 uh because it just receives um 53:46 it probably got it from the 53:47 configuration information in the 53:49 previous time 53:50 right like when you know configuration 53:51 service decides on the new configuration 53:53 they can tell all the servers and 53:54 whatever and the clients that actually 53:56 care you know here's the new 53:57 configuration 53:59 wait does this only happen when like s1 54:02 to s2 connection is separate or 54:04 wait what so what causes the split brain 54:06 again 54:07 display brain would happen if s2 on its 54:09 own decided that as one has failed and 54:11 became the head 54:13 and so we're not allowed to have that 54:14 happen and the way actually the 54:16 will work out in practice is that there 54:18 is a configuration server 54:19 that actually decides what is actually 54:21 the current configuration 54:23 and so if it decides that s1 is dead 54:27 then it can inform s2 and f3 saying like 54:30 hey 54:30 you guys are now the new chain and s2 is 54:33 the head 54:34 and when that change happens so in this 54:36 case basically s1 is dropped 54:38 nothing else has to happen uh because uh 54:41 s2 ha the only update that we lost is 54:45 the one that actually was not committed 54:46 anyway so there's nothing to be repaired 54:47 further 54:49 so making this going from this setting 54:52 from free replicas with the dropping the 54:54 head is a basically pretty 54:55 straightforward operation 55:00 okay i have a question 55:03 so there is there's an assumption here 55:06 right that um 55:07 like the uh commands that like 55:11 like leave s1 like will arrive 55:14 in order uh in s2 55:17 is that like is that a reasonable 55:20 assumption for like 55:21 well i think the way they uh so they 55:23 basically say you know we 55:24 need a reliable fifo between s1 and s2 55:27 right and from s2 to s3 55:29 and i think the way they basically 55:30 implement that is probably using tcp 55:32 connection 55:34 okay thanks okay so 55:38 uh let's look at the second case uh so 55:40 we 55:41 have you know s1 you know 55:44 s2 s3 55:47 and of course there could be more you 55:48 know servers in the chain but like you 55:50 know 55:50 uh three is enough for us to consider 55:52 all the cases 55:53 and so now what we want to do take the 55:56 case where a middle one crashes 55:58 so this one crashes and uh 56:02 so the configuration server at some 56:03 point decides yeah s2 is crashed 56:05 you know informs s1 and s3 uh basically 56:08 they form the new chain 56:10 uh and we're wondering about like what 56:13 else needs to happen 56:15 okay we saw in the first case the head 56:17 drops then nothing really has to be done 56:19 other than updating the chain now we're 56:21 updating the chain and the question is 56:23 is anything needs to happen 56:26 the s1 needs to send to s3 the request 56:30 that it's sent to s2 but didn't make it 56:32 to s3 56:33 yes exactly right so we have you know u1 56:36 u2 56:36 u3 uh we this was guy had seen u1 and u2 56:41 this guy has ceo one uh 56:44 and you know the u2 that's actually in 56:46 progress you know 56:47 uh uh 56:51 and then i got lost mass two basically 56:52 s1 has to bring s3 up to date 56:54 and basically four and u2 and u3 56:58 okay so there's actually a little bit of 56:59 work involved let's consider the final 57:01 case 57:03 the tail so here we go again 57:06 we have three cases it's one 57:09 or the three servers s2 yeah the third 57:13 one s3 57:16 and uh and let's see so 57:19 the tail crashes 57:22 and so at some point in time uh 57:26 the configuration server notices decides 57:28 that the new chain is going to be f1 and 57:30 s2 tells f1 and m2 57:31 that they're part of the new chain and 57:34 uh what else needs to happen 57:38 well let's write down we know 57:41 these guys have seen u1 u2 and u3 this 57:44 guy has seen u1 and u2 57:47 so who becomes the new tail in this 57:49 scenario 57:52 s2 yes student becomes a new tab and 57:54 anything else that needs to happen 57:57 um i guess the client needs to be 57:59 informed that s2 is the 58:01 yeah that's what the client might learn 58:03 is from the configuration server correct 58:05 uh and so yeah but nothing else has to 58:07 happen right because 58:08 uh the all the committed you know the 58:13 no committed operations are lost uh and 58:16 uh you know and as we are still in 58:20 uh needs to still be actually propagated 58:22 to s2 and won't be just happening 58:24 okay so dropping details also reasonable 58:26 straightforward so dropping the tail in 58:27 the head is 58:28 reasonable straightforward dropping the 58:29 middle one is a little bit more 58:30 complicated 58:31 but not much more complicated and the 58:33 key thing though i want to 58:35 sort of emphasize here is how does this 58:37 compare to figure seven and eight 58:39 in the rafter 59:03 these new operations have been 59:04 automatically committed 59:06 uh sorry i didn't hear you uh you were 59:08 pretty uh 59:09 it was a pretty noisy connection there 59:12 yeah so i was just saying that this too 59:15 becomes a new uh 59:16 tale don't we have to send 59:18 acknowledgements back to the client that 59:19 there are some entries that have been 59:21 automatically committed 59:23 uh yeah that might be the case that uh 59:26 what 59:27 will happen is the client is probably 59:28 gonna retry correct and 59:30 uh we have to have a separate dude 59:32 duplication scheme like in 59:33 uh lab three anyway and so there's 59:36 a probably a couple different ways about 59:39 how to do it the paper is actually not 59:40 particularly clear 59:41 which one it will take 59:44 thank you yeah like in that in that case 59:47 then in the paper just say like you know 59:50 it might 59:51 like even if it doesn't respond it could 59:53 or could not like have succeeded 59:55 right like yeah okay so 59:58 back to my actually original question 59:59 which is like you know how does this 60:01 contrast this picture my you know my 60:03 drawing picture here on this white board 60:05 how does that contrast to figure seven 60:07 and eight 60:08 simpler yeah i mean that's the key point 60:11 i wanted to get across right these uh 60:13 you know there's 60:14 not that many cases to consider here 60:15 basically three cases 60:17 uh which is like slightly uh you know 60:19 quite a bit simpler than 60:20 uh the in the case of the wrap paper 60:23 where you know the many convenience to 60:24 consider and 60:25 the scenarios are quite complicated now 60:27 part of that is because 60:28 you know it's a chain right you know 60:30 things are pushed down 60:32 uh the in a very sort of uh 60:34 straightforward manner 60:36 down the replication chain and part of 60:38 that is of course you know the 60:39 configuration part 60:40 is sort of outsourced to the 60:42 configuration manager manager 60:44 but for the primary backup uh part of 60:47 the recovery plan that's a reasonable 60:48 straightforward you know there are only 60:50 three configurations to 60:51 consider 60:54 now one more sort of point i want to 60:56 make which is like you know how to add a 60:57 replica 60:58 because any system you know that you're 61:00 going to run for real time really at 61:02 some point you got to 61:04 add a new one in because otherwise you 61:06 know you're going to lose when you start 61:08 your three then you have two then you 61:09 have one and then you have zero and then 61:10 you're unveiled 61:11 so you know you have to add new 61:15 replicas so let's consider the case uh 61:17 so here's s1 61:19 it is the head and let's say we're in 61:22 the scenario where 61:23 you know we have s1 and s2 which detail 61:26 and basically we want to bring up you 61:28 know s3 61:32 and it turns out you know as the paper 61:33 described that it was most convenient to 61:35 actually do this 61:37 at the uh at the tail end of it and so 61:40 basically make the new server 61:42 the uh the new tail and so the way that 61:46 would proceed 61:46 is like so the client is here it's 61:48 talking to 61:49 uh s2 because that's the current tail uh 61:52 s3 comes up 61:54 and basically what the first thing it 61:55 does is actually copies you know all the 61:57 state from s3 to 61:58 from s2 to s3 and so this may take hours 62:02 right or you know tens of minutes or 62:05 maybe indeed multiple hours is like 62:06 we're copying 62:07 you know gigabytes of data or terabytes 62:09 of data from s2 to s3 62:11 but while that's happening you know as 62:13 two and as you know as free can just 62:15 surf requests of course it does have to 62:18 remember which ones uh 62:20 are came in after you know s3 start 62:23 copying so keep a list of like all the 62:25 updates that are sort of 62:26 have happened but there have not been 62:28 propagated to s3 yet 62:30 at some point s3 is done you know with 62:31 all the copying and basically tells us 62:34 to 62:34 okay man i'm ready you know to become 62:36 detail i got the whole state 62:38 uh s2 says uh and so sends an email 62:40 basically it says emails it's a message 62:42 s2 saying like okay i want to be the 62:46 tail 62:46 has two responses like yeah that's okay 62:49 but once you're 62:50 applied all the updates 62:53 and so basically s2 sends the updates in 62:56 response to 62:57 this i want to be go on become the 63:01 tail uh requests you know in response to 63:03 that to s3 63:04 s3 applies the updates and then becomes 63:07 the the 63:08 detail and you know clients that were 63:11 talking to s2 and s2 can tell the client 63:13 you know from now on i'm not the tail 63:14 anymore you should talk to that screen 63:16 and so they can swap you know that 63:18 direction 63:19 and so that's the way to add a replica 63:22 into a chain 63:26 so question on this um don't you run 63:28 into this like infinite loop problem 63:30 where s2 sends updates to s3 and when s3 63:32 is like 63:33 updating it's also serving more requests 63:36 and so has more updates 63:37 that needs to send and goes back and 63:39 forth no no like 63:41 once s2 has sent the updates that s3 has 63:44 not seen yet you know to s2 63:46 to f3 then from then on its normal chain 63:49 replication whenever s2 gets a 63:50 request you know an update from s1 you 63:53 know it forwards it to s3 63:57 right but s3 can't become the tail until 63:59 it is successfully processed 64:01 all of the updates oh yeah so uh yeah 64:04 once it sets up the tcp channel 64:05 basically s2 can just say like 64:07 from once you have processed these guys 64:09 you can become the tail because you have 64:11 seen everything 64:12 i mean and everything else you know can 64:13 be pipelined after that right in the 64:15 same tcp channel 64:18 it it it could become the tail right 64:21 right after 64:21 like even before it processed the update 64:23 right as long as it doesn't serve 64:25 requests 64:26 and as long as doesn't serve requests 64:27 exactly right it just has to process all 64:29 the updates that s2 has received in s3 64:31 not 64:32 once it's updated those then it becomes 64:34 the 64:35 tail and start processing requests 64:38 i see so it blocks like requests for a 64:41 moment while 64:41 it processes the new updates got it 64:44 exactly 64:48 okay 64:52 okay so now uh you know i want to come 64:55 back you know go to 64:56 basically questions lots of people ask 64:57 you know how this contrasts to 64:59 sort of you know how are the cr 65:02 properties 65:02 or chain replication properties how do 65:05 how does it compare 65:07 what are the good ass what are the good 65:09 properties and mostly you know with 65:11 respect or in comparison to 65:12 rap and of course you know 65:16 i gotta say of course like uh the the 65:18 the chain replication 65:20 just implements the the primary backup 65:22 scheme but not you know the 65:23 configuration surface 65:24 and so we'll come back to that a little 65:25 bit more in detail but a couple of 65:26 things that we can note 65:27 if we just compare sort of uh 65:31 the way the raf protocol works with the 65:33 chain replication protocol 65:35 and first of all you know we can you 65:37 know positive aspect of you know chain 65:38 replication is that 65:40 the client rpcs 65:46 are in a split between 65:51 between the head and the tail 65:56 and so the load of actually serving any 65:59 client operation that can be split 66:00 actually between two of them 66:02 they don't have to all run through the 66:03 leader as in in raft 66:06 furthermore the head sends an update 66:11 once so unlike in raft 66:14 where the head or the leader basically 66:17 sends updates 66:18 the log entries to every appear in this 66:21 particular scheme 66:22 actually it's only happened the the head 66:24 only sends one 66:26 basically rpc and so there are fewer 66:29 messages involved 66:32 reads or query operations 66:36 involve only 66:40 only the tail all right in the raft you 66:44 know the 66:45 if you even if you implement the 66:47 read-only optimization 66:49 the read-only optimization avoids 66:52 having the read operation to go through 66:54 the log and 66:55 being appended to all the walks that 66:57 appears but it still requires that the 67:00 leader actually contacts the majority of 67:02 the peers to decide 67:03 whether to actually the operation can be 67:07 served 67:10 and so another positive aspect of the 67:13 spinosa simple 67:14 crash recovery as we talked about 67:23 uh but you know a major downside you 67:26 know compared to 67:26 sort of the raf scheme is that one 67:30 failure 67:34 requires reconfigurations 67:44 and the reason that the recognition uh 67:47 required is because like 67:48 a right actually has to go through the 67:51 whole chain 67:52 and so and the right cannot be 67:54 acknowledged until that every you know 67:56 server in the chain actually has 67:57 processed it and that is actually 67:59 slightly different correctly wrapped as 68:00 you well know 68:01 because as soon as basically a majority 68:04 the peers actually have accepted the 68:06 particular write operation and appended 68:07 to their logs 68:08 the system can just proceed and so 68:10 there's actually no 68:11 interruption at all if like one server 68:13 fails for example 68:15 and if the remaining server still form a 68:17 majority now while in chain replication 68:20 if one server would fail then you know 68:22 some refrigeration actually has to 68:24 happen which means you know there's 68:25 going to be a short period of you know 68:26 probably downtime 68:28 right does that make sense 68:34 now i want to make one more point uh 68:37 sort of in contrast you know to sort of 68:39 the 68:39 rafter replication scheme is that this 68:42 because the read operations involve only 68:44 one server there's a cool 68:46 uh extension if you will that actually 68:49 gets really high read performance 68:51 and so the basic idea is as follows 69:10 uh the basic idea is basically to split 69:12 you know the 69:14 split objects or they called volumes in 69:16 the paper 69:17 split the objects across multiple chains 69:27 so instead of having one chain as i 69:28 would have done in the previous 69:30 boards you know we're going to have 69:31 multiple chains and so for example we 69:33 might have a 69:34 you know ch1 and in chain one 69:37 you know s1 is the head s2 is the 69:40 middle guy and s3 is the tail in chain 69:43 two 69:45 yeah we're going to rotate things around 69:47 s2 is the head 69:48 as one is the where s3 is the middle guy 69:51 oops 69:52 s3 is the middle guy and s1 69:56 is the tail and then chain 3 69:59 we're going to arrange basically s3 is 70:01 the head guy s2 is the 70:04 uh s1 is the middle guy and s2 70:07 is the tail and basically what we're 70:10 going to do is we're going to split 70:12 objects across these multiple chains so 70:13 the configuration server basically has a 70:15 map you know saying like you know 70:17 shard one you know objects insured one 70:19 go to chain one 70:20 uh objects in shar2 go to chain two 70:23 objects in shard three now go to chain 70:25 three 70:27 uh and what is the cool part about it is 70:29 that uh 70:30 you know we have no multiple tails and 70:32 we as well as three is the tail 70:34 for some chain as one is a uh a tail for 70:37 some uh 70:38 chain that's two is a tail for some 70:40 chain and basically 70:41 read operations for these different 70:42 chains uh can now be completely executed 70:45 in parallel 70:46 so if the read operations sort of hit 70:48 all the different charge 70:50 uh sort of uniformly spread that 70:51 basically you know the our read 70:52 throughput is going to increase you know 70:54 linearly with the number of uh 70:56 tails that we have and in this case we 70:57 have three tails so we get three times 70:59 the read performance 71:01 so we can basically sort of same with 71:02 some you know we get a little bit of the 71:03 same properties that zookeeper had 71:05 where the real performance can be 71:07 excellent you know it can scale with the 71:08 number server 71:10 uh but we also get not only that we get 71:12 the scale part 71:13 but we maintain the linearizability 71:17 in this scheme we don't have to actually 71:19 give up on linearizability 71:22 so we get sort of the two nice 71:25 properties mainly you know good 71:26 re-performance in fact the scaling with 71:28 the number of servers 71:29 at least for reach or two different 71:31 chains uh and 71:33 we got actually uh and we maintain 71:38 linearizability 71:43 any questions about this 71:49 sorry in this case the client when they 71:51 are deciding 71:53 which um chain to read from 71:56 would they be able to to do it like to 71:59 decide themselves or do they need to 72:01 contact the configuration server to 72:04 decide 72:05 yeah so this is a great question uh so 72:07 typically and the paper actually doesn't 72:09 really 72:09 it's explicit about it you know they 72:11 talk about maybe talking through 72:12 proxy you know to the servers uh what 72:15 you will do in lab four 72:16 is basically you will download the 72:17 configurations the configuration will 72:19 include 72:20 the shard assignment and you will 72:23 download that from the configuration 72:24 server 72:28 you need to be careful about how you 72:30 ordered the servers in each of the 72:31 chains to prevent like 72:32 a particular chain from being 72:34 oversaturated or a particular link 72:36 between like two servers 72:38 uh yeah this this this scheme doesn't 72:40 really take that into account uh you can 72:42 imagine like 72:43 the configuration planner uh the 72:45 configuration manager has a 72:46 sophisticated model of actually how the 72:48 network 72:48 is laid out and you know can be very 72:50 careful about how the chains are done 72:53 maybe even shower more shards to one 72:55 chain or fewer shards to another chain 72:57 uh all that stuff is in principle 72:59 possible right because the configuration 73:00 can just manage you can just compute 73:02 anything it likes and basically says 73:04 here's the assignment 73:08 thank you i can even rebalance if it 73:10 wants to 73:18 answer the question could you explain 73:19 again how linearized deal is kept under 73:21 this 73:21 uh extension well nothing has really 73:23 changed 73:24 uh we're still doing primary backup uh 73:26 using a chain 73:28 uh and uh and so you know the 73:32 we basically carry over the 73:33 linearizability from the single chain 73:35 and that's it 73:42 this might be like speculative but how 73:44 this compared to 73:46 or i guess maybe it's equivalent of just 73:48 like having groups of servers for each 73:50 like link instead of having reusing the 73:52 same ones or 73:53 not for each link but for each uh 73:57 step in the chain so like s1 is like 73:59 three servers s2 is three servers 74:01 instead of reusing the same one and 74:04 entering from different points 74:06 uh yeah what would be the advantage of 74:09 that 74:10 scheme that you imagine i mean just for 74:11 scalability um 74:14 while also maintaining mineralizability 74:17 well the 74:18 the reason that this scheme is 74:19 attractive is that because we have in 74:21 the tail might have quite a bit of load 74:22 but the middle guy doesn't 74:23 and you know by having sort of this 74:25 arrangement we spread the load across 74:27 all servers 74:30 i see okay 74:40 okay good so well maybe i want to 74:42 summarize here 74:47 and sort of talk a little bit about so 74:50 we saw this approach one 74:52 which we do which we do in lab3 which is 74:57 you know we run all the operations 74:59 through raft and all the uh 75:01 replication you know the configuration 75:03 and the replication is all built using 75:05 uh 75:06 a raft and nothing else is involved and 75:08 then sort of this approach to 75:10 you know which is the topic of this 75:12 particular paper 75:13 uh where you know there's a 75:15 configuration server perhaps built 75:17 using raft or you know packs with anyone 75:21 and a primary backup replication scheme 75:24 and primary backup 75:28 using chain replication 75:39 and you can see you know hopefully you 75:41 know this lecture makes it 75:42 clear that you know there's some 75:43 attractive properties to approach too 75:46 uh in the sense that uh you can get 75:48 scalable re-performance 75:50 uh on the primary backups uh of course 75:53 not on the configuration server because 75:54 you know it 75:55 runs raft like you know you do in 75:57 approach one 75:59 but you get at least you know maybe 76:00 scalable reperforms for actually your 76:02 operations on the replicas or on the 76:04 the primary background scheme like to 76:06 put in get operations 76:08 um and the other uh thing that is nice 76:11 about this is that sort of 76:12 the if you have your data is very very 76:15 large you know you can have more 76:16 specialized 76:17 uh uh synchronization 76:20 or uh uh schemes to basically copy the 76:23 state from one machine to another 76:25 machine 76:26 uh and you know the chain replication 76:28 order but any sort of primary backup 76:30 scheme that sort of separated from the 76:31 configuration server 76:32 allows you to do that easily and so it's 76:34 a quite 76:35 common that you know the in practice you 76:37 know people call products too 76:39 although it is also not impossible to 76:41 actually uh 76:42 use approach one for your replicated 76:45 state machine including servicing um 76:49 operations like you know put together 76:50 operations and the factors receiving 76:52 lab3 would do it 76:53 we will see later a paper in uh semester 76:56 called spanner 76:57 that actually uses you know access to 76:58 actually also do the operations 77:03 any further questions here 77:14 if not then i wish you all good luck on 77:16 the midterm on thursday 77:18 and i'll see you in person well 77:20 virtually in person 77:22 uh next week 77:28 and if you have any questions please 77:30 feel free to hang around 77:31 and i'll try to do my best to answer 77:35 that 77:37 question yep i have a question about 77:39 something that 77:40 you mentioned about raft so you 77:42 mentioned that all of the reads have to 77:43 go through the majority of servers 77:45 but i'm not quite sure i understand why 77:47 because the leader 77:49 has all of the committed entries right 77:52 yeah there's two 77:52 two schemes uh if the leader 77:56 so either you run in the situation where 77:58 all recent rights are served by the 78:00 leader 78:01 right or you know there's a possibility 78:03 to serve in principle 78:05 a read operation from another peer but 78:07 then you have to contact the first and 78:08 majority 78:10 of the service to make absolutely sure 78:11 that you have the last operation 78:14 got it so that that requirement is if we 78:17 want to 78:18 spread all the reads across every year 78:21 and then we have to be more 78:22 sophisticated and we cannot do it on our 78:24 own because that would definitely 78:26 directly interact ability 78:28 right but if everything goes to the 78:29 leader we don't then then you're 78:31 in your golden correct except you know 78:33 you have to do this trick where you have 78:34 sort of an empty agreement 78:36 uh at the beginning of every new term 78:41 just to make sure that you actually are 78:42 up to date 78:45 okay thank you could you um 78:49 quickly go over again uh when you're 78:51 adding a new server at the tail 78:54 so just to make sure i understand so 78:56 essentially 78:57 it starts this process for like copying 78:59 all the data from s2 to s3 79:01 yep and then if it receives requests for 79:04 any of that data while 79:06 that is still happening then s3 is going 79:08 to like 79:09 ask s2 for anything that it still has 79:12 directly and it's going to get it and 79:13 then respond 79:14 yep and then it keeps doing that until 79:16 it gets 79:17 data that s2 no longer has and then it 79:19 just goes live essentially 79:21 yeah well you could do it slightly 79:22 differently you know you could actually 79:24 have 79:24 s3 you could uh basically tell 79:27 s2 79:31 s3 can become the leader or the sort of 79:34 detail and 79:36 basically don't process any comment 79:38 operations from the client yet until it 79:40 actually has 79:41 has received the remaining operations 79:43 from s2 79:46 oh so in that case s2 is still the tail 79:51 yeah you do not know s3 gets everything 79:53 yeah 79:54 okay thank you there's basically 79:57 the paper describes one particular way 79:59 of doing it there's a couple ways of 80:01 doing it 80:04 but if you do that then um like 80:07 how long do you wait to get everything 80:11 um i think i also have the same 80:14 confusion as someone else well you know 80:16 you know in what order the switch is 80:17 happening correct like so for example 80:19 like 80:19 s3 uh so s2 is let's say maybe has 80:22 update operations 80:24 through 100 you start the copy operation 80:26 it's like with the snapshots in the 80:28 raft you start the copy operation so 80:30 when the copy operation is done s3 is up 80:32 to date until 100 80:34 right then you know maybe s2 already has 80:38 done 80:38 10 more operations so it has 101 102 and 80:41 103 80:42 right and basically uh 80:46 as free you can you know contact us too 80:48 saying give me 80:49 your remaining operations and stu says 80:50 like well my remaining operations 101 80:52 through 1 and 10. and and as a side 80:56 effect you know 80:57 s3 also tells us to like stop being 80:59 detailed 81:01 and s2 responds with those operations 81:05 s3 applies those operations 101 through 81:07 110 81:08 and then tells client and in the 81:11 meantime it is tail but it doesn't 81:12 process any commands from clients yet 81:14 or read operation client yet until it 81:16 actually has process 101 through 110. 81:21 okay okay i see i see 81:24 like that oh 81:33 my question was a little similar to the 81:36 uh extension that he talked about 81:39 um i thought i thought about 81:42 could you could he do a tree instead of 81:45 a chain 81:46 so ah uh i think there's 81:50 the uh there are other data structures 81:53 possible 81:54 uh like for example you know a number of 81:56 people in email proposed 81:58 that you could like have s one then you 82:00 could have like uh 82:01 s two three four five all the 82:03 intermediate ones you know s1 talks to 82:04 them in parallel to all the intermediate 82:07 ones and then the intermediate ones in 82:08 their top all to 82:10 the tail uh is that what you mean with a 82:14 tree 82:17 i meant morph there would be 82:20 like a number of leaves that would be 82:24 at all roughly the same height so like a 82:26 balanced tree 82:27 and then the leaves will have like a 82:30 chain going through them 82:33 um and i think that like linear 82:36 disability 82:37 can be broken here if you like 82:40 um if you think harder about it but 82:44 it would it would have the nice property 82:45 that you wouldn't like 82:48 then the propagation delay would be like 82:50 logarithmic instead of linear 82:53 as here and you could read from all the 82:54 leaves but yeah okay reading from all 82:57 the leaves is dangerous correct because 82:58 uh they might have you know one client 83:00 might have talked to another 83:02 or another leaf earlier and these guy 83:04 might not be in sync 83:06 uh so that sounds dangerous to me 83:11 but maybe your your scheme is a little 83:12 bit more sophisticated than i'm uh i'm 83:14 thinking 83:15 um the depth of the tree or the depth of 83:18 the chain is really uh 83:20 governed by the mean time between 83:22 failure correct 83:23 uh if you're on the main team uh if you 83:26 typically run with free service or three 83:27 to five servers because that's good 83:29 enough for your 83:30 high availability then because you can 83:33 you know recover from four 83:35 servers before the whole thing is down 83:37 um 83:38 then uh that really governs the depth of 83:41 the chain 83:43 um and yeah that will introduce some 83:46 latency 83:49 okay that makes sense yeah thank you 83:51 change will generally be short 83:53 right okay okay that makes sense thanks 83:58 is this the only case where the entire 84:00 chain would go down 84:02 if all of the servers in the chain went 84:04 down yep 84:07 thank you uh 84:10 i also was curious how you like maintain 84:12 the strong consistency when like 84:14 uh s1 s2 and s3 can all like do the read 84:18 in this slide you get strong consistency 84:24 per shard or per object you know that's 84:26 assigned to the chain 84:27 right so you read uh 84:30 you write one object you write object 84:33 one you read object one 84:35 all those operations are going to go 84:36 through the same chain and so you get 84:37 strong consistency for that particular 84:39 object 84:40 oh got it but it may not indicate that 84:43 across all the objects we have strong 84:44 consistency 84:46 no i don't uh i will let me hesitate or 84:49 not 84:51 uh 84:57 let me have state i think you know that 84:59 requires maybe more machinery 85:00 well what does that mean like across all 85:02 the objects 85:04 getting stronger you read write object 85:07 one 85:08 you rewrite object two and 85:12 then uh some client reach or object one 85:15 uh are you gonna yeah basically you know 85:18 if you have a client that reached 85:19 both objects are you guaranteed 85:21 guaranteed to see total order and 85:23 uh linearizing that like serializability 85:26 uh the serializability is slightly 85:28 different um you know i mean 85:30 let's not talk about serializability 85:31 we'll come back we'll get to get that 85:33 later in a couple weeks 85:34 uh yeah and um 85:40 i'm having to do actually make a 85:42 commitment right now i need to think a 85:44 little bit about it 85:45 okay that's totally fair 85:48 so the question is like you know you 85:51 have consistency or linearizability for 85:52 a single 85:53 client reading and writing but not for 85:56 multiple 85:57 on multiple objects you have even 85:59 multiple clients 86:00 talking to this you know doing 86:02 operations on the same object have 86:04 strongly interacted have linearizability 86:06 correct in this scheme 86:07 the question is do you have you know 86:09 linearizability across 86:11 objects too but 86:14 why is that important i i don't see 86:16 where like 86:18 where that would be important like 86:21 because you're you're gonna like 86:25 you can't group operations right so like 86:28 a right and right right i mean you read 86:31 object one 86:32 you write update one you read object one 86:34 then you read up direct q you write 86:35 object two 86:37 uh and you know the 86:40 the question is you know in 86:42 linearizability you know those 86:44 operations need to be total order and 86:46 they need to pursue uh 86:48 preserve this property off uh real time 86:52 and since you're here you have different 86:54 chains that might actually 86:55 not happen but i don't want to commit to 86:58 no statement about it all across change 87:01 you know within the chain that's 87:02 absolutely guaranteed linearizability 87:03 even if you have different objects 87:08 there's something i don't understand in 87:09 the paper which is the um 87:11 update propagation invariant where like 87:15 uh sort of the for 87:18 in this order of the chain yep the like 87:22 uh commits are prefix of each successors 87:27 commits is that guaranteed 87:30 after like a full pass has gone through 87:33 the chain 87:34 well it's always true right like you 87:35 know if you go back to this picture here 87:38 um i think the 87:42 makeup basically very simple observation 87:44 which is let's see if i can find a good 87:46 picture 87:47 i probably scribbled over everything so 87:50 it's going to be maybe not as clean 87:52 basically what they're saying like if 87:54 you look at this figure correct 87:56 that um s3 always has a prefix of s2 and 88:00 s2 always has a prefix of s1 88:05 and that's the only thing that basically 88:07 that invariant says 88:09 oh the so i and j 88:13 so the the one out the successor is has 88:15 the prefix of the predecessor yeah 88:17 and this is slightly confusing i just 88:19 realized that later uh 88:20 after somebody else asked this question 88:22 uh 88:23 in the definition the 88:26 are is in two different ways uh well not 88:29 really in different ways one is a 88:30 definition 88:32 uh and one is actually the the the 88:34 invariant 88:36 um and uh you guys are gonna be a little 88:38 bit careful 88:39 the uh the roles of the ice and in the 88:42 very 88:42 other around yeah exactly how is that 88:44 possible 88:45 exactly thank you you're welcome 88:49 uh yeah sorry go ahead 88:54 come on i was just gonna ask um what 88:56 happens when you have like a network 88:58 partition instead of a crash 88:59 um so if you go to like the crash slide 89:02 uh 89:03 what happens to the chain if there's a 89:05 network partition so maybe s 89:07 something like uh uh like s2 is actually 89:10 still alive 89:11 but there's a partition between the 89:12 configuration manager and 89:14 an s2 or something and so now both s1 89:16 and s2 are pointing to s3 89:20 no no no uh yeah okay yeah 89:23 okay there's there's gonna be some uh 89:25 presumably but i 89:26 i think the the paper doesn't talk about 89:27 this but i presume that all 89:28 configurations are numbered 89:30 uh like a view number and uh s3 will not 89:34 accept any commands from s2 89:36 if the few numbers don't match 89:39 oh you got it thank you so related to 89:42 that uh one thing i couldn't figure out 89:44 is even with like uh configuration 89:45 numbers or something 89:47 how do you make sure that when you get 89:48 rid of the tail in like the third 89:50 scenario that you've drawn 89:51 now that when you get to the tail that 89:53 uh all the clients that might issue a 89:55 read 89:55 are aware that this old server is no 89:57 longer fail clients 89:58 i think the the way you would do it is 90:00 that the configuration 90:01 when the client download the 90:02 configuration from the configuration 90:04 manager they also include the view 90:05 number 90:06 and every operation includes the view 90:08 number and and 90:09 s3 you will see hey that's an old view 90:12 number 90:13 don't talk to you i i guess uh when then 90:16 does the clients talk to the 90:17 configuration server to get a new view 90:18 number 90:22 so for example the the s2 could just say 90:25 like retry 90:26 and the client then could go back to the 90:28 configuration server and re-read to the 90:30 state 90:32 uh i i guess what i'm worried about is 90:33 like s3 has been network partitioned 90:35 away from the 90:36 the court uh from the coordinator and so 90:39 the coordinator like gets rid of s3 from 90:40 the tail and increases the version 90:42 number 90:42 uh but some client out there doesn't 90:44 find out that the version numbers 90:45 increase 90:46 and still thinks s3 is the tail talks s3 90:48 does a read 90:49 meanwhile people are doing rights to s1 90:51 s2 and 90:52 they haven't heard it or they haven't 90:54 seen that basically 90:56 yeah this is probably the reason why in 90:57 the paper go through the proxy 91:02 see okay 91:07 i have a question kind of going back to 91:09 that 91:10 like i noticed a question about um kind 91:13 of like cross 91:14 object linearizability 91:17 um i guess like is that a whole another 91:21 can of worms that we haven't really 91:23 talked about which is like if you want 91:24 to do 91:25 i don't know what the right term is but 91:26 like transactions that like 91:28 across multiple pieces of state like if 91:32 you're 91:33 trying to implement an operation where 91:34 it's like you're setting a to one and b 91:36 to two and you should only see those 91:38 together or not at all like at ethnicity 91:40 of that have we 91:41 talked about that in any of the things 91:43 we've seen before no and we'll talk 91:44 about within a couple weeks 91:46 okay they're going to be a big topic 91:48 basically how to do transactions 91:50 okay that that's good thanks 91:54 do you mind going back to like the third 91:56 slide 91:58 yep i think it was the third slide 92:03 um maybe not fourth slide 92:09 oh if it's light sorry um 92:12 so i was a little bit confused uh when 92:14 you mentioned like if 92:15 lockholder failed an intermediate state 92:17 stuff um 92:18 what exactly on this slide applies to 92:20 z-locks and what applies to the 92:22 locks again uh these are almost all 92:26 statements about z-locks 92:29 so is the first statement that if the 92:31 lock holder fails then the intermediate 92:33 state is 92:34 not cleaned up or is cleaned up no the 92:37 immediate state is visible but then for 92:40 example if you have a leader election 92:41 you could clean up that intermediate 92:42 state that was just like the 92:46 the point oh so with with go locks is 92:49 that also not the case like if there's a 92:50 machine that's holding a golock 92:52 that's doing stuff and then it all of a 92:54 sudden dies 92:56 isn't still the intermediate state 92:57 visible 92:59 okay i think what i'm talking about is 93:01 gold locks you know is something that's 93:03 a statement about 93:04 multiple threats running on the same 93:05 machine right and so 93:08 if the go lock disappears because the 93:10 machine crashes all the threads on that 93:11 machine crashed too 93:20 right but when you say that the 93:23 intermediate state is visible 93:25 to other people it that's still true for 93:28 gold locks though right 93:30 you know if they written persistent 93:31 state to disk uh 93:34 or into some shared file system but like 93:36 no the machine is gone the disk is gone 93:37 everything's gone 93:40 oh got it okay so it's saying that the 93:42 pers like the persistence 93:44 like the intermediate state is 93:45 persistent the zookeeper intermediate 93:47 state might be visible right not to 93:49 delete the thing the guy might have 93:50 created some more files and you know 93:52 those are visible now 93:54 okay i see thank you so just to follow 93:57 up on that so it's the implication that 94:00 if a goal routine ever dies while 94:02 holding the walk the entire go program 94:05 must have died too 94:07 and there's you can never have a go 94:09 routine die holding a lock with that 94:10 with 94:10 and like have other parts of the program 94:12 continue no 94:13 okay the go routine crashes the 94:16 application crashes