字幕記錄 00:01 let's imagine three servers with logs 00:07 that looked like this where the numbers 00:13 I'm writing are the term numbers of the 00:15 command that's in that log entry so we 00:17 don't really care what the actual 00:18 commands are then I got a number the log 00:20 slots and so let's imagine that the 00:35 presumably the the next term is term six 00:38 although you can't actually tell that 00:40 from looking the evidence on the board 00:42 but it must be at least six or greater 00:44 let's imagine that server s3 is chosen 00:47 as the leader for term six and at some 00:52 point s3 the new leader is going to want 00:56 to send out a new log entry so let's 00:58 suppose it wants to send out its first 01:00 log entry per term six so we're sort of 01:02 thinking about the append entries our 01:04 PCs that the leader is going to send out 01:06 to carry the first log entry for term 01:11 six really should be under slot thirteen 01:13 the rules in Figure two say that an 01:16 append entries are bc actually as - as 01:18 well as the command that the client sent 01:22 in to the leader that we want to 01:23 replicate on the logs of all the 01:25 followers there's this append entries 01:28 RPC also contains this previous log 01:33 index field and a previous log term 01:39 field and when we're sending out an end 01:44 pend entries for where this is the first 01:46 entry we're leaders supposed to put 01:48 information about the previous slot the 01:51 slot before the new information sending 01:54 out so in this case the log index of the 01:57 previous entry is 12 and the term of the 02:04 command in the leaders log for the 02:06 previous entry is by so sends out this 02:10 information to the followers 02:13 and the followers before they accept a 02:16 upend entries are supposed to check you 02:18 know they know they've received an 02:19 append entries that for some log entries 02:23 that start here and the first thing they 02:26 do is check that there are previous the 02:29 receiving followers check that their 02:30 previous log entry matches the previous 02:34 information that follow that the leader 02:36 sent out so for a server to of course it 02:38 doesn't match the server to has a entry 02:42 here all right but it's an entry from 02:43 term for not from turn five and so the 02:46 server twos going to reject this append 02:49 entries and sort of send a false reply 02:51 back leader and server one doesn't even 02:54 have anything here so server ones gonna 02:56 also reject the append entries in the 03:00 leader and so so far so good right the 03:03 terrible thing that that has been 03:05 averted at this point is you know the 03:08 bad thing we absolutely don't want to 03:09 see is that server to actually stuck the 03:12 new blog entry in here which would break 03:15 sort of inductive proofs essentially 03:20 that the figure to scheme relies on and 03:24 hide the fact that server two actually 03:27 had a different log so instead of 03:28 accepting log entry server two projects 03:30 this RPC the leader sees is two 03:34 rejections and leader is maintaining 03:36 this next index field one for each 03:38 follower so it has a next index for 03:45 server two and the leader has a next 03:48 index for server one presumably if the 03:54 should have said this before if the 03:56 server sending out information about 03:58 slot thirteen here that must mean that 04:01 the server's next index is for both of 04:03 these other servers this started out as 04:09 thirteen and that would be the case at 04:11 the server if this leader had just 04:13 restarted because the figure two rules 04:14 say that next index starts out at the 04:16 end of the new leaders log and so in 04:21 response to errors the leaders supposed 04:23 to decrement its next in 04:25 steal so it does that for both got 04:28 errors from boat deca mr. Calvin resends 04:34 and this time the server is going to 04:36 send out append entries with previous 04:39 log index equals 11 04:41 and previous log term equals 3 and this 04:47 new append entries has it has a 04:50 different previous log index but it's 04:53 the content in the log entries that the 04:55 server is going to send out this time 04:58 include you know all the entries after 05:01 that the new previous log index is 05:03 sending out so server 2 now the previous 05:06 log index 11 it looks there and it sees 05:10 a ha you know the term is 3 same as what 05:12 the reader is sending me so server 2 is 05:15 actually going to accept this append 05:17 entries and figure 2 rules say oh if you 05:20 accept a pendent we supposed to delete 05:22 everything in your log after where the 05:24 append entry starts and replace it with 05:26 whatever's in the append entries so 05:28 server tune is going to do that now it's 05:32 he just went to 5 6 server 1 still has a 05:35 problem cuz it has nothing at slot 11 05:38 middle would return another error the 05:41 server will now backup its server 1 next 05:47 index 2 11 it'll send out its log 05:51 starting here with the previous index 05:54 and term referring now to this slot and 05:56 this one's actually acceptable server 1 05:58 it'll adopt it'll accept the new log 06:00 entries and send a positive response 06:03 back to the server and now they're all 06:07 now they're all caught up and the 06:14 presumably the server also when it sees 06:17 that followers accepted and append 06:21 entries that had a certain number of log 06:22 entries it actually increments this next 06:24 index could be 14 4 alright so the net 06:29 effect of all this backing up is that 06:31 the server has used a backup mechanism 06:33 to detect the point at which the 06:36 followers logs 06:38 started to be equal to this 06:39 servers and then sent each of the 06:42 followers starting from that point that 06:43 a complete remainder of the server's log 06:47 after that last point at which they were 06:49 equal any questions all right 07:01 just to repeat discussion we've had 07:05 before and we'll probably have again you 07:07 notice that we erased some blog entries 07:09 here which are now su erase that I 07:11 forget what they were 4 & 5 so there 07:15 were some well actually that was mostly 07:18 remember we erased this log entry here 07:21 this used to say for um server - the 07:24 question is why was it ok for the system 07:27 to forget about this client command 07:29 right this thing we erased corresponds 07:32 to some client command which are now 07:33 throwing away I talked about this 07:36 yesterday what's the rationale here yeah 07:42 so it's not a majority of the servers 07:45 and therefore whatever previous leader 07:46 it was who sent this out couldn't have 07:49 gotten acknowledgments from a majority 07:51 of servers therefore that previous 07:52 leader couldn't have decided it was 07:55 committed couldn't have executed it and 07:57 applied it to the application state 07:58 could never have sent a positive reply 08:00 back to the client so because this isn't 08:03 done a majority of servers we know that 08:05 the client who send in and has no reason 08:07 to believe it was executed couldn't have 08:08 gotten a reply because one of the rules 08:10 is the server only sends over the leader 08:12 only sends a reply to a client after it 08:15 commits and executes so the client had 08:19 no reason to believe it was even 08:20 received by any server and then and the 08:23 rules of figure to basically say the 08:24 client if he gets no response after a 08:26 while it supposed to resend the request 08:27 so we know whatever request this was it 08:29 threw away we've never executed never 08:33 included in any state already and the 08:36 clients gonna resend it by-and-by yes 08:58 well it's always deleting suffix of the 09:03 followers log I mean in the end the sort 09:08 of backup answer to this is that the 09:10 leader has a complete log so all its 09:13 fails it can just send us complete log 09:16 to the follower and indeed if you know 09:19 if you've just started up the system and 09:20 something very strange happened even at 09:22 the very beginning then you may end up 09:24 actually you know maybe in some of the 09:26 tests for lab two you may end up backing 09:28 up to the very first entry and then 09:31 having the leader essentially send the 09:32 whole log but because the leader has 09:34 this whole law we know it could sort of 09:35 it's got all the information that's 09:37 required to feel everybody's logs if it 09:40 needs to 09:49 okay all right so in this example which 09:53 I guess are now erased we elected s3 as 09:57 the leader and the question is could we 10:02 you know who can we who are we allowed 10:04 to elect this leader right cool 10:08 you know that all right if you read the 10:10 paper you know the answer is not just 10:12 anyone it turns out it matters a lot for 10:15 the correctness the correctness of the 10:17 system that we don't allow just anyone 10:19 to be the leader like for example the 10:21 first node whose timer goes off may in 10:24 fact not be an acceptable leader and so 10:28 it turns out raft has some rules that 10:29 applies about oh yes you can be leader 10:32 or you can't be leader and to see why 10:35 this is true let's sort of set up a 10:37 straw man proposal that maybe raft 10:43 should accept should use the server with 10:46 the longest log as the leader right you 10:49 know some alternate universe that could 10:50 be true and it is actually true in 10:52 systems with different designs just not 10:55 in raft so the question we're 10:57 investigating is why not use the 11:02 cervical longest law as leader and this 11:09 would involve changing the voting rules 11:11 in raft have a voters only vote for 11:16 nodes that have longer logs all right so 11:19 the example that's going to be 11:21 convenient for showing why this is a bad 11:23 idea so let's imagine we have three 11:25 servers again and now the log set setups 11:29 are server Wan has entries for terms 11:34 five six and seven server two four five 11:38 and eight and server three also four 11:41 five and eight that's the first question 11:46 of course to avoid spending our time 11:50 scratching our heads about utter 11:51 nonsense is to make sure that convince 11:54 ourselves that this configuration could 11:56 actually arise because if it couldn't 11:58 possibly arise then 11:59 may be a waste of time to figure out 12:01 what would happen if it did arise so 12:03 anybody wanna propose a sequence of 12:07 events whereby this set of logs could 12:10 have arisen how about an argument that 12:17 it couldn't have arisen oh yeah okay so 12:27 well maybe we'll back up sometime 12:31 all right so server one wins is wins the 12:34 election at this point and it's in term 12:37 six sends out yeah it receives a client 12:42 request sends out the first append 12:44 entries and then that's fine actually 12:52 everything's fine so far nothing's wrong 12:56 all right well a good bet for all these 12:59 things is then it crashes right or it 13:01 receives the client requests in term six 13:04 it appends the client requests to its 13:06 own log which it does first and it's 13:08 about to send out a pen entries but it 13:10 crashes yes it didn't send out any pen 13:12 entries and then you know we need then 13:14 it crashes and restarts very quickly 13:16 there's a new election and gosh server 13:19 one is elected again as the as the new 13:22 leader it receives in term seven and 13:24 receives a client request appends it to 13:25 its log and then it crashes right and 13:33 then after after a crashes we have a new 13:35 election maybe server 2 gets elected 13:37 this time maybe server 1 is down now so 13:42 off the table if server 2 is elected at 13:45 this point suppose server 1 is still 13:48 dead what term is server what server two 13:51 venues 13:56 yeah eights the right answer so why 13:58 eight and not remember this you know 14:00 this is now gone why eight and not six 14:07 that's absolutely right so not written 14:09 on the board but in order for server one 14:11 to have been elected here it must have 14:12 votes from majority of nodes which 14:14 include at least one of server-to-server 14:16 three if you look at the vote request 14:21 code and figure two if you vote for 14:24 somebody you're you're supposed to 14:25 record the term in persistent storage 14:30 and that means that either server 2 or 14:32 server 3 are both new about term six and 14:35 in fact term seven and therefore when 14:38 sever one dies and they cannot elect a 14:39 new leader at least one of them knows 14:41 that the current term was eight if that 14:44 one and only that one actually if 14:48 there's only one of them only that one 14:49 could win an election because it has the 14:51 higher terminal birth they both know 14:52 about term eight sorry if they both know 14:53 about term seven then they'll both and 14:56 either one of them will try to be leader 14:57 and term eight so that fact of that the 15:00 next term must be term a dis is insured 15:02 by the property of the majorities must 15:05 overlap and the fact that current term 15:08 is updated by vote request and is 15:10 persistent and guarantee did not be lost 15:12 even if there were some crashes here so 15:14 the next term is going to be eight 15:15 server two or server three will win the 15:17 leadership election and let's just 15:19 imagine that whichever one it is sends 15:21 out append entries for a new client 15:25 requests the other one gets it and so 15:27 now we have this configuration right so 15:30 I was a bit of a detour we're back to 15:32 our original question of in this 15:36 configuration suppose server one revives 15:38 we have an election would it be okay to 15:42 use server one would it be okay to have 15:44 the rule be the longest log wins the 15:48 longest log gets to be the leader yeah 15:54 obviously not right because server was a 15:56 leader did it's going to force its log 16:04 on to the to followers by the append 16:07 entries machinery that we just talked 16:08 about a few minutes ago 16:09 if we live server one to be the leader 16:11 it's gonna you know sent out a pen 16:13 entries whatever backup overwrite these 16:15 aids tell the followers to erase their 16:19 log entries for term a to accept to 16:21 overwrite them with this six and seven 16:23 log entries and then to proceed now with 16:26 identical to server ones so of course 16:30 why are we upset about this yeah yeah 16:39 exactly it was already committed right 16:43 it's not a majority of servers has 16:45 already committed probably executed 16:50 quite possibly a reply sent to a client 16:52 so we're not entitled to delete it and 16:56 therefore server one cannot be allowed 17:00 to become leader and force its log onto 17:02 servers two and three everybody see why 17:06 that's bad idea for rapid and because of 17:12 that this can't possibly have been rule 17:15 for elections of course shortest log 17:19 didn't work too well either and so in 17:23 fact if you read forward to section 17:26 something five point four point one 17:32 draft actually has a slightly more 17:34 sophisticated election restriction that 17:43 the request vote handling RPC handling 17:47 code is supposed to check before it says 17:49 yes before votes yes for a different 17:51 peer and the rule is we only vote you 18:00 vote yes for some candidate who send us 18:02 over request votes only if candidate has 18:10 a higher 18:15 term in the last log entry or same last 18:26 term same charming the last log entry 18:31 and a log length that's greater than or 18:38 equal to the the server that received 18:44 that received the boat request and so if 18:51 we apply this here if server two gets a 18:53 vote request from server one there our 18:58 last log entry terms or seven the server 19:03 one's gonna send out a request votes 19:04 with a last entry term whatever of 7 19:08 server twos is eight so this isn't true 19:11 server service we didn't get a request 19:14 from somebody with a higher term in the 19:16 last entry and or the last entry terms 19:22 aren't the same either said the second 19:23 Clause doesn't apply either so neither 19:26 server to new serve nor server three is 19:28 going to vote for server one and so even 19:30 if it sends out this vote requests first 19:32 because this has a shorter election 19:33 timeout nobody's going to vote for it 19:35 except itself so I don't think it's one 19:36 vote it's not a majority if either 19:38 server two or server three becomes a 19:41 candidate then either of them will 19:44 accept the other because they have the 19:45 same last term number and their logs are 19:48 each greater than or equal to in length 19:50 and the others so either of them will 19:52 vote for for the other one will server 19:55 one vote for either of them 19:56 yes because either server 2 or server 3 20:00 has a higher term number in the last 20:01 entry so you know what this is doing is 20:05 making sure that you can only become a 20:07 candidate if or it prefers candidates 20:11 that knew about higher that have log 20:13 entries some higher terms that is it 20:15 prefers candidates that are more likely 20:17 to have been receiving log entries from 20:18 the previous leader and you know this 20:23 second part says well we were all 20:25 listening to the previous leader then 20:26 we're going to 20:27 for the server that has saw more 20:30 requests from the very last leader any 20:35 questions about the election restriction 20:45 okay final thing about sending out log 20:55 entries is that this rollback scheme at 20:59 least as I described it and it's as its 21:01 described in Figure two rolls back one 21:03 log entry at a time and you know 21:08 probably a lot of fun that's okay 21:09 but there are situations maybe in the 21:13 real world and definitely in the lab 21:15 tests where backing up one entry at a 21:18 time is going to take a long long time 21:20 and so the real-world situation where 21:22 that might be true is if they if a 21:25 follower has been down for a long time 21:27 and missed a lot of upend entries and 21:30 the leader restarts and if you follow 21:33 the pseudocode in Figure two if a leader 21:34 restarts is supposed to set its next 21:36 index to the end of the leaders log so 21:38 if the follower has been down and you 21:40 know miss the last thousand log entries 21:42 and leader reboots the leader is gonna 21:45 have to walk back off one at a time one 21:48 RPC at a time all thousand of those log 21:50 entries that the follower missed and 21:53 there's no you know particular reason 21:55 why this would never happen in real life 21:57 it could easily happen at somewhat more 22:01 contrived situation that the tests are 22:03 definitely explorers is if a follower is 22:07 if we say we have five servers and 22:09 there's there's a leader but the leaders 22:13 got trapped with one follower in a 22:17 network partition but the leader doesn't 22:18 know it's not leader anymore and it's 22:19 still sending out append entries to its 22:21 one follower and none of which are 22:23 committed while in the other majority 22:25 partition the system is continuing as 22:28 usual the ex leader and follower in that 22:32 Minority partition could end up putting 22:35 in their logs you know sort of unlimited 22:37 numbers of log entries for a stale term 22:40 that will never be committed and need to 22:42 be deleted and overwritten eventually 22:44 when they rejoin the main group that's 22:47 maybe a little less likely in the real 22:48 world but you'll see it happen and the 22:52 test set up so in order to be able to 22:55 back up faster that paper has 22:57 somewhat a vague description of a faster 22:59 scheme towards the end of section 5.3 23:05 it's a little bit hard to interpret so 23:07 I'm gonna try to explain what their 23:10 ideas about how to back up faster a 23:11 little bit better and the general idea 23:13 is to be able to to have the follower 23:15 send enough information to the leader 23:17 that the leader can jump back an entire 23:19 terms worth of entries that have to be 23:22 deleted per append entries so it leader 23:26 may only have to send one in a pennant 23:27 and append entries per term in which the 23:30 leader and follower disagree instead of 23:33 one per entry so there's three cases I 23:38 think are important and the fact is that 23:39 you can probably think of many different 23:42 log backup acceleration strategies and 23:46 here's one so I'm going to divide the 23:50 kinds of situations you might see into 23:51 three cases so this is fast backup case 24:01 one I'm just going to talk about one 24:06 follower and the leader and not worry 24:09 about the other nodes the same we have 24:12 two server one which is the follower and 24:18 server 2 which is the leader so this is 24:25 one case and here we need to backup over 24:29 a term where that term is entirely 24:31 missing from the leader another case 24:44 so in this case we need to back up over 24:47 some entries but their entries for a 24:49 term that the leader actually knows 24:50 about so apparently the this followers 24:53 saw a couple of entry a couple of the 24:56 very Flass few append entries sent out 24:58 by a leader that was about to crash but 25:01 the new leader didn't see them we still 25:03 need to back up over them and a third 25:05 case is where the followers entirely 25:11 missing the following the leader agree 25:15 but the followers is missing the end of 25:19 the leaders log and I believe you can 25:24 take care of all three of these with 25:27 three pieces of extra information in the 25:29 reply that a follower sends back to the 25:32 leader in the case in the append entries 25:35 so we're talking about the append 25:37 entries reply if the follower rejects 25:42 the append entries because the logs 25:43 don't agree there's three pieces of 25:46 information that will be useful and 25:47 taking care of three street cases I'll 25:49 call them X term which is the term of 25:55 the conflicting entry I remember the 25:58 leader sent this previous log term and 26:05 if the follower rejects it because it 26:07 has something here but the terms wrong 26:09 so it'll put the followers term for the 26:12 conflicting entry here or you know I'm 26:19 negative one or something it doesn't 26:21 have anything in the log there it'll 26:25 also send back the index of the 26:30 conflicting but the index are the first 26:33 entry with that term 26:46 and finally if there wasn't any log 26:49 entry there at all the follower will 26:52 send back on the length of its law like 26:56 the followers log so for case one the 27:02 way this helps if the it's a leader sees 27:12 that the leader doesn't even have an 27:16 entry with X term of term X term at all 27:19 in its log so that's this case where the 27:22 leader didn't have turn five and if the 27:24 leader can simply back up to the 27:26 beginning of the followers run of 27:30 entries with X term that is the the 27:34 leader can set its next index to this X 27:37 index thing which is the first entry the 27:41 followers run of items from term five 27:45 alright so if the leader doesn't have X 27:48 term at all it should back up to X back 27:51 the follower up to X index the second 27:53 case you can detect the fault the leader 27:55 can detect if X term is valid and the 27:59 leader actually has log entries of term 28:04 X term that's the case here where the 28:08 you know the disagreement is here but 28:10 the leader actually has some entries 28:12 that term in that case the leader should 28:14 back up to the last entry it has that 28:18 has the contesta followers term for the 28:22 conflicting term in it that is the last 28:24 entry that a leader has for term for in 28:26 this case and if neither of these two 28:29 cases hold that is the well actually if 28:33 the follower indicates by maybe setting 28:36 X term to minus one it actually didn't 28:37 have anything whatsoever at the 28:39 conflicting log and index because it's 28:41 log is too short then the leader should 28:46 back up its next index to the last entry 28:49 that the follower had at all and start 28:51 sending from there 28:53 and I'm telling you this because it'll 28:55 be useful for doing a lab and if you 29:00 miss some of my description it's it's in 29:03 electronics then any questions about 29:05 this backing up business Jack I think 29:20 that's true yeah yeah yeah maybe binary 29:25 search I'm not ruling out other 29:26 solutions I mean that you know after 29:29 reading the papers non description of 29:32 how to do it I like cook this up and 29:34 there's probably other ways to do this 29:37 probably better ways and faster ways of 29:39 doing it like I'm I'm sure that if 29:40 you're willing to send back more 29:41 information or have a more sophisticated 29:43 strategy like binary search you can do a 29:46 better job yeah well you you almost 29:50 certainly need to do something 29:53 experience suggests that in order to 29:55 pass the tests you'll need to do 29:57 something to as well probably not me 30:02 although I that's not quite true like 30:04 one of the solutions I've written over 30:06 the years actually does the stupid thing 30:08 and still passes the tests but because 30:11 the tests you know the one of the sort 30:15 of unfortunate but inevitable things 30:17 about the tests we give you is that they 30:19 have a bit of a real time requirement 30:21 that is the tests are not willing to 30:23 wait forever for your solution to 30:25 produce an answer so it is possible to 30:29 have a solution that's you know 30:30 technically correct but takes so long 30:33 that the tester gives up and 30:36 unfortunately you know we will the 30:39 tester will fail you if your solution 30:40 doesn't finish the test and whatever the 30:42 time limit is and therefore you do 30:44 actually have to pay some attention to 30:46 performance in order you know your 30:50 solution has to be both correct and have 30:52 enough performance to finish before the 30:54 tester gets bored and sometimes out on 30:56 you which is like 10 minutes or I don't 30:58 know what it is and unfortunately it's 31:00 relatively this stuff's complex enough 31:02 that it's not that hard to write a color 31:04 correction 31:05 that's not fast enough yes so the way 31:15 you can tap the leader can tell the 31:16 difference is that the follower we're 31:20 supposed to send back the term number it 31:23 sees in the conflicting entry you we 31:25 have case one if the leader does not 31:29 have that term in its log 31:31 so here the follower will set X term to 31:34 five to five because this is this is 31:37 going to be the this is gonna be the 31:39 conflicting entry the follower says this 31:44 X term to five the leader observes oh I 31:46 do not have term five in my log and 31:48 therefore this case one and you know it 31:57 should back up to the beginning 31:58 like it doesn't follower hasn't leader 32:00 has none of those and term five entry so 32:02 it should just get rid of all of them in 32:04 the follower by backing up to the 32:05 beginning which is X index yeah yeah 32:20 because the leaders gonna back up its 32:22 next index to here and then send an 32:25 append entries that starts here and the 32:28 rules a figure to say ah the follower 32:29 just has to replace its log so it is 32:31 gonna get rid of the fives okay alright 32:37 the next thing I want to talk about is 32:38 persistence you'll notice in Figure two 32:42 that the state in the upper left-hand 32:44 corners sort of divided and summer 32:47 marked persistent and some are marked 32:50 volatile and what's going on here is 32:54 that the the distinction between 32:57 persistence and volatile you know only 32:59 matters if a server reboots crashes and 33:03 restarts because the persistent what the 33:06 persistent means is that if you change 33:08 one of those items it's marked 33:09 persistent you're supposed to the server 33:14 supposed to write it to disk or to some 33:15 other non-volatile storage like as 33:17 or battery-backed something or whatever 33:20 that will ensure that if the server 33:23 restarts that it will be able to find 33:26 that information and sort of reload it 33:28 into memory and that's to allow us to 33:34 allow servers to be able to pick up 33:35 where they left off if they crash and 33:37 restart now you might think that it 33:46 would it would be sufficient and simpler 33:48 to say well if a server crashes then we 33:51 just throw it away and or we need to be 33:55 able to throw it away and replace it 33:57 with a brand-new empty server and bring 33:59 it up to speed right and of course you 34:01 do actually it is vital to be able to do 34:04 that right because if some server 34:06 suffers a failure of some catastrophic 34:08 failure like it's you know disk melts or 34:10 something you absolutely need to be able 34:14 to replace it and you cannot count on 34:17 getting anything useful off its disk if 34:18 something bad happened to its disk so we 34:20 absolutely need to be able to replace 34:22 completely replace servers that have no 34:24 state whatsoever you might think that's 34:28 sufficient to handle any difficulties 34:30 but it's actually not it turns out that 34:32 another common failure mode is power 34:34 failure of you know the entire cluster 34:38 where they all stop executing at the 34:40 same time right and in that case we 34:43 can't handle or we can't handle that 34:46 failure by simply throwing away the 34:48 servers and replacing them with new 34:50 hardware that we buy from Dell we 34:53 actually have to be able to get off the 34:56 ground we need to be able to get a copy 34:58 of the state back in order to keep 35:01 executing if we want our service to be 35:04 fault tolerant and therefore in order at 35:07 least in order to handle the situation 35:09 of simultaneous power failure we have to 35:11 have a way for the server's to sort of 35:13 save their state somewhere where it will 35:15 be available when the power returns and 35:19 that's one way of viewing what's going 35:20 on with persistence it said that's the 35:23 state that's required 35:26 to get a server going again I'm after 35:28 either a single power failure or power 35:31 failure of the entire cluster 35:33 alright so figure two this three items 35:38 only three items are persistent so 35:44 there's a log that's like all the log 35:49 entries current term and voted for and 36:03 by the way you know one of us server 36:06 reboots it actually has to make an 36:07 explicit check to make sure that these 36:09 data are valid on its disk before it 36:14 rejoins the raft cluster I have to have 36:17 some way of saying oh yeah I actually do 36:18 have some save persistent state as 36:20 opposed to a bunch of zeros that that 36:24 are not valid all right so the reason 36:28 why log has to be persisted is that at 36:34 least according to figure two this is 36:36 the only record of the application state 36:40 that is figure two doesn't really have a 36:42 notion fears two does not say that we 36:44 have to persist the application state so 36:46 if we're running a database or you know 36:48 a test and set service like for vmware 36:50 ft the actual database or the actual 36:53 value of the test and set flag isn't 36:55 persistent according to figure two only 36:57 the logins and so when the server 36:59 restarts the only information available 37:02 to reconstruct the application state is 37:05 the sequence of commands in the log and 37:08 so that has to be persisted that's what 37:13 about current term why does current term 37:17 have to be persistent 37:34 yeah so they're both about ensuring that 37:37 there's only one that each term has at 37:39 most one leader so yeah so voted for the 37:43 specific you know potential damaging 37:45 case is that if a server receives a boat 37:48 request and votes for server one and 37:50 then it crashes and if it didn't persist 37:53 this the identity of who had voted for 37:55 and in my crash we start get another 37:58 boat request for the same term from 37:59 server two and say gosh I haven't voted 38:01 for anybody because my voted for is 38:03 blank 38:03 now I'm gonna vote for server 2 and now 38:05 our servers voted for server 1 and for 38:08 server 2 in the same term and that might 38:12 allow two servers 38:14 since both server and server to voted 38:16 for themselves they both may think they 38:17 have a majority out of three and they're 38:19 both going to become leader now we have 38:20 two simultaneous servers for the same 38:23 term so this that's why I voted for it 38:24 has to be persistent current term is 38:28 gonna be a little more subtle but we 38:30 actually talked before about how you 38:34 know again we don't want to have more 38:36 than one server for a term and if we 38:38 don't know what term number it is then 38:41 we can't necessarily then it may be hard 38:44 to ensure that there's only one server 38:46 for a term and I think maybe in this 38:49 example ya if s if server 1 was down and 38:54 server 2 and server 3 we're gonna try to 38:57 elect a new server they need evidence 38:59 that the correct turn numbers 8 and not 39:02 6 right because if they if they forgot 39:04 about current term and it was just 39:06 server 2 and server 3 voting for each 39:08 other and they only had their log to 39:09 look at they might think the next term 39:10 should be term 6 they did that they 39:12 start producing stuff for term 6 but now 39:14 there's gonna be a lot of confusion 39:16 because we have two different term sixes 39:18 and so that's the reason my current term 39:21 has to be persistent to preserve 39:24 evidence about term numbers that have 39:27 already been used these have to be 39:34 persisted pretty much every time you 39:38 change them right so certainly the safe 39:42 thing to do is every time you add an 39:44 entry of log or change current term 39:46 are said voted for you need you probably 39:51 need to persist that and in a real raft 39:53 server that would mean writing it to the 39:54 disk so you'd have some set of files 39:56 that recorded this stuff you can 39:59 probably be a little bit you may be can 40:01 cut some corners if you observed that 40:04 you don't need to persist these things 40:08 until you communicate with the outside 40:09 world so there may be some opportunity 40:11 for a little bit of batching by saying 40:13 well we don't have to persist anything 40:14 until we're about to reply to an RPC or 40:17 about to send out an RPC I mean that may 40:20 allow you to avoid a few persisting x' 40:23 the reason that's important is that 40:27 writing stuff to disk is can be very 40:31 expensive it's a if it's a mechanical 40:32 hard drive that we're talking about then 40:34 writing anything you know if the way 40:37 we're persisting is writing files on the 40:38 disk writing anything on the disk cost 40:41 you about 10 milliseconds because you 40:43 either have to wait for the disk to spin 40:45 for the point you want to write to spin 40:47 under the head which disk only rotates 40:49 about once every 10 milliseconds or 40:51 worse that you may actually have to seek 40:53 to move the arm the right track right so 40:55 these per systems can be terribly 40:58 terribly expensive and if for sort of 41:01 any kind of straightforward design 41:03 they're likely to be the limiting factor 41:06 in performance because they mean that 41:09 doing anything anything whatsoever on 41:13 these graph servers takes ten 41:15 milliseconds a pop and 10 milliseconds 41:18 as far longer than it takes to say send 41:20 an RPC or almost anything else you might 41:23 do 10 milliseconds each means you can 41:26 just never if you persist data to a 41:29 mechanical drive you just can never 41:31 build a raft service it can serve more 41:33 than 100 requests per second because 41:37 that's what you get it at 10 41:38 milliseconds per operation and you know 41:41 this is this cost so this is really all 41:44 about cost of synchronous 41:49 just updates and it comes up in many 41:58 systems like file systems the file 41:59 systems that are running in your laptops 42:00 are that the designers spend a huge 42:03 amount of time sort of trying to 42:05 navigate around the performance problems 42:07 of synchronous disk up they think of as 42:09 disk writes because in order for stuff 42:10 to get safe on your disk in order to 42:12 update the file system on your laptop's 42:14 disk safely there turns out the file 42:18 system has to like be careful about how 42:20 it writes and needs to sometimes wait 42:22 for the disk to finish writing so this 42:25 is a like a cross-cutting issue in all 42:27 kinds of systems certainly comes up in 42:29 draft if you want it to build a system 42:33 they could serve more than a hundred 42:34 quests per second then there's a bunch 42:38 of options one is you can use a 42:39 solid-state drive or some kind of flash 42:41 or something solid eight drives can do a 42:44 write to the flash memory in maybe a 42:50 tenth of a millisecond so that's a 42:52 factor of a hundred for you or if you're 42:55 even more sophisticated maybe you can 42:57 build yourself battery backed DRAM and 43:02 do the persistence into battery back 43:03 DRAM and then if the server reboots hope 43:07 that reboot was took shorter than the 43:11 amount of time the battery lasts and 43:12 that this stuff you persisted is still 43:14 in the RAM and the reason I mean if you 43:17 have money and sophistication the reason 43:19 to favor that is you can write DRAM you 43:21 know millions of times per second and so 43:23 it's probably not going to be a 43:24 performance bottleneck anyway so that 43:28 this problem is why and it's sort of 43:33 marking a persistent versus volatile and 43:35 figure 2 is like has a lot of 43:36 significance for performance as well as 43:38 crash recovery and correctness any 43:43 questions about persisting yeah 43:55 yes alright so your question is 44:08 basically you're writing code say go 44:10 code for your raft implementation or 44:12 you're trying to write a real rafterman 44:13 implementation and you actually want to 44:15 make sure that when you persist your an 44:18 update to the law or the current term or 44:20 whatever that it in fact will be there 44:21 after a crash and reboot like what's the 44:23 recipe for what you have to do to make 44:26 sure it's there and your observation 44:28 that if you call you know on a UNIX or 44:31 Linux or whatever Mac if you call right 44:34 you know the right system call is how 44:36 you write to a disk file you simply call 44:38 right as you pointed out it is not the 44:41 case that after the write returns the 44:43 data is safe on disk and will survive a 44:45 reboot it almost certainly isn't almost 44:48 certainly not on disk so the you know 44:51 the particular piece of magic you need 44:53 to do is on unix at any rate you need 44:56 you need to call right so you cannot 44:58 write some file you've opened that's 45:01 going to contain the stuff that you want 45:02 to write and then you got a call this F 45:06 st. call which on most systems the 45:09 guarantee is that F sync doesn't return 45:12 until all the data you've previously 45:15 written into this file is safely on the 45:18 surface on the media in a place on a 45:22 place where it will still be there if 45:23 there's a crash so so this thing is some 45:26 then this call is an expensive call and 45:29 that's why it's a separate that's why 45:30 Wright doesn't write the disk only F 45:33 sync does is because it's so expensive 45:35 you would never want to do it unless you 45:37 really wanted to persist some stuff some 45:40 data okay so you can use more expensive 45:46 disk hardware the other trick people 45:47 play a lot is to try to batch that is if 45:51 you can if client requests are if you 45:53 have a lot of client requests coming in 45:55 maybe you should accept a lot of them 45:57 and not reply to any of them for a 45:59 little bit we call a lot of them 46:00 accumulate 46:01 and then persist you know a hundred log 46:05 entries at a time from your hundred 46:07 clients and you know only then send out 46:09 the append entries good because you do 46:12 actually have to persist this stuff to 46:13 disk if you receive a client request you 46:16 have to persist the new entry to disk 46:17 before you send the append entries our 46:20 PCs the followers because you're not 46:24 allowed if the leader you know the 46:26 leader it's essentially promising to 46:29 commit that that request and can't 46:34 forget about it 46:35 and indeed the followers have to persist 46:37 the new log entry to their disk before 46:39 they reply to the append entries because 46:40 they were apply to the append entries 46:42 it's also a promise to preserve and 46:45 eventually commit that log entry so they 46:46 can't be allowed to forget about it if 46:48 they crash other questions about 46:51 persistence all right well final you 47:01 know a little detail about persistence 47:02 is that some of the stuff in figure two 47:09 is not persistent and so it's worth 47:11 scratching your head a little bit about 47:12 why commit index lasts apply next index 47:15 and match index why it's fair game for 47:17 them to be simply thrown away if the 47:19 server crashes and restarts like why 47:22 wasn't you know commit index or last 47:24 apply it like geez last applied is the 47:26 record of how much we've executed right 47:29 if we throw that away aren't we gonna 47:30 execute log entries twice and is that 47:32 correct how about that why is why is it 47:35 safe to throw away last applied 47:46 yes I am we're all about simplicity and 47:55 safety here with raft so that's exactly 47:58 correct the the reason why all that 48:02 other stuff can be non-volatile as you 48:04 mentioned I mean sorry volatile the 48:06 reason why those other fields can be 48:07 volatile and thrown away is that we can 48:10 the leader can reconstruct sort of 48:12 what's been committed by inspecting its 48:15 own log and by the results of append 48:17 entries that it sends out to the 48:19 followers I mean initially the leader if 48:20 it if everybody restarts because they 48:22 experienced a power failure 48:23 initially the leader does not know 48:24 what's committed what's executed but 48:27 when it sends out log and append entries 48:29 it'll sort of gather back information 48:31 and essentially from the followers about 48:32 What's in how much of their logs match 48:34 the leaders and therefore how much must 48:36 have been committed before the crash 48:38 another thing in the 4-2 world which is 48:41 not the real world 48:43 another thing about figure two is that 48:45 figure two assumes that the application 48:47 state is destroyed and thrown away if 48:51 there's a crash in a restart so the 48:54 figure two world assumes that while log 48:55 is persistent that the application state 48:57 is absolutely not persistent required 49:00 not to be consistent in figure 2 because 49:04 the in figure 2 the log is preserved 49:07 persisted from the very beginning of the 49:10 system and so what's going to happen if 49:13 you sort of play out what the various 49:15 rules in figure 2 after a leader restart 49:18 is that the leader will eventually re 49:21 execute every single log entry that is 49:24 handed to the application you know 49:26 starting with log entry one after a 49:28 reboot it's the raft is gonna hand the 49:31 application every log entry starting 49:33 from one and so that will after a 49:34 restart the application will completely 49:36 reconstruct its state from scratch by a 49:39 replay from the beginning of the time of 49:41 the entire log after each restart and 49:45 again that's like a sort of 49:46 straightforward elegant plan but 49:49 obviously potentially very slow 49:56 which brings us to the next topic which 49:58 is log compaction and and snapshots and 50:04 this has a lot to do with lab 3b 50:07 actually you'll see log compaction and 50:09 snapshots in vlog 3b in lab 3b and so 50:13 the problem that log compaction and 50:15 snapshotting is solving a raft is that 50:18 indeed for a long-running system that's 50:20 been going for weeks or months or years 50:22 if we just follow the figure 2 rules the 50:25 log just keeps on growing may end up you 50:27 know millions and millions of entries 50:28 long and so requires a lot of memory to 50:30 store if you store it on disk like if 50:34 you have to persist it every time you 50:35 persist the log it's using up a huge I 50:37 may not space on disk and if a server 50:39 ever be starts it has to reconstruct its 50:41 state by replaying these millions and 50:44 millions of log entries from the very 50:46 beginning which could take like hours 50:47 for a server to run through its entire 50:50 log and we execute it if it crashes and 50:52 restarts all of which is like similar 50:54 what kind of wasted because before it 50:56 crashed it had already had applications 50:58 state and so in order to cope with this 51:08 wrath has this idea of snapshots and the 51:11 sort of idea behind snapshots is to be 51:15 able to save or ask the application to 51:18 save a copy of its state as of a 51:20 particular log entry so we've been 51:23 mostly kind of ignoring the application 51:24 but the fact is that you know if we have 51:28 a suppose we're building a key value 51:30 store under BRAF you know the log is 51:33 gonna contain a bunch of you know 51:34 putting gets or read and write request 51:37 so maybe a law contains you know a put 51:39 that some client wants to set X to one 51:42 and then another one where it says X to 51:44 2 and then you know y equals 7 or 51:47 whatever and if there's no crashes as 51:51 the raft is executing along there's 51:53 going to be this if the layer above Rath 51:55 there's going to be this application and 51:57 the application if it's a key value 51:59 store databases it's going to be meeting 52:01 this table and as raft hands it one 52:05 command after our next 52:07 the applications going to update its 52:09 table so you know after the first 52:10 command it's going to set X to one and 52:12 it's stable after the second command 52:14 it's going to update its table you know 52:19 one interesting fact is that for most 52:22 applications the application state is 52:24 likely to be much smaller than the 52:26 corresponding log right at some level we 52:29 know that the the you know the log and 52:31 the state are the log in that and the 52:33 state as of some point in the log are 52:35 kind of interchangeable right they both 52:38 sort of implied the same thing about the 52:40 state of the application but the log may 52:44 contain a lot of you know a lot of 52:46 multiple assignments 2x they use up a 52:48 lot of space in the log but are also to 52:49 effectively compact it down to a single 52:51 entry in the table and that's pretty 52:53 typical of these replicated applications 52:56 but the point is that instead of storing 53:00 the log which may go to be huge we have 53:02 the option of storing instead the table 53:05 which might be a lot smaller and that's 53:08 what the snapshots are doing so when 53:11 raft feels that it's log has gotten to 53:14 be too large you know more than a 53:17 megabyte or ten megabytes or whatever 53:19 some arbitrary limit raft will ask the 53:21 application to take make a snapshot of 53:24 it the application state as of a certain 53:26 point in the log 53:28 so if we add if raft asked the 53:30 application for a snapshot reference it 53:33 would pick a point in the log that the 53:35 snapshot referred to and require the 53:37 application to produce a snapshot as at 53:39 that point this is extremely critical 53:41 because the because what we're about to 53:44 do is throw away everything before that 53:45 point so if there's not a will to find 53:47 point that corresponds to a snapshot 53:48 then we can't safely throw away the log 53:51 before that point so that means that 53:54 Rath is gonna have you know ask for 53:57 snaps on the snap so it's basically just 53:58 the table it's just about a database 54:00 server and we also need to annotate the 54:04 snapshot with the entry number that are 54:07 corresponds to you so it's basically you 54:09 know if the entries are 1 2 3 this 54:12 snapshot corresponds to just after log 54:16 index 3 with the snapshot in hand 54:19 if we persist it to disk rats persistent 54:23 to disk raft never again will need this 54:26 part of the logs and it can simply throw 54:33 it away as long as it persists a 54:36 snapshot as of a certain in debt log 54:39 index plus the log after that index as 54:42 long as that's persisted to disk we 54:44 never going to need to log before that 54:46 and so this is what RAF does the rocks 54:49 ask the application for snapshot gets 54:51 the snapshot saves it to disk with the 54:52 log after that it just throws away this 54:54 log here right and so it really operates 54:58 or the sort of persistence story is all 55:00 about pairs of a snapshot in the log 55:03 after that after the point in the log 55:06 associated with snapshot I don't see 55:09 this yes 55:24 no it's still it's it's you know there's 55:27 these sort of phantom entries one two 55:29 three and this you know suffix of the 55:31 log is indeed viewed as still the it's 55:37 maybe the right way to think of it is 55:39 still there's just one log except these 55:41 entries are sort of phantom entries that 55:43 we that we can view as being kind of 55:46 there in principle but since we're we 55:48 never need to look at them because we 55:51 have the snapshot the fact that they 55:52 just happened not to be stored anywhere 55:53 is neither here nor there but it's but 55:57 yeah you should think of it as being 55:58 stole the same log it's just not just 56:01 threw away their early entries did this 56:04 that's a maybe a little bit too glib of 56:06 an answer because the fact is that 56:07 figure two talks about the log in ways 56:10 that makes it that if you just follow 56:12 figure to you sometimes still need these 56:14 earlier entries and so you'll have to 56:15 reinterpret figure two a little bit in 56:17 light of the fact that sometimes it says 56:19 blah blah blah a log entry where the log 56:22 entry doesn't exist okay 56:39 okay and so what happens on a restart 56:43 so the restart story is a little more 56:44 complicated in it than it used to be 56:46 with just a log what happens on a 56:48 restart is that there needs to be away 56:50 for raft to give the latest for graph to 56:54 find the latest snapshot log pair on its 56:56 disk and hand the snapshot to the 57:01 application because we no longer are 57:03 able to replay you know all the log 57:04 entries so there must be some other way 57:06 to initialize the application basically 57:08 not only is the application have to be 57:10 able to produce a snapshot of 57:11 application state but but it has to be 57:13 able to absorb a previously made 57:15 snapshot and sort of reconstruct it 57:17 stable in memory from a snapshot and so 57:20 this now even though raft is kind of 57:22 managing this whole snapshotting stuff 57:23 the snapshot contents are really the 57:26 property to the application and RAF 57:28 doesn't even understand what's in here 57:29 only the application does because it's 57:31 all full of application specific 57:33 information so after a restart the 57:36 application has to be able to absorb the 57:39 latest snapshot that raft found so for 57:45 just this simple it would be simple 57:48 unfortunately this snapshotting and in 57:52 particular the idea that the leader 57:54 might throw away part of its log 57:56 introduces a major piece of complexity 57:59 and that is that if there's some 58:01 follower out there whose log ends before 58:05 the point at which the leaders log 58:10 starts then unless we invent something 58:14 new we need monney install snapshot 58:15 unless we invent something new that 58:17 follower can never get up-to-date right 58:20 because if the followers you know if 58:23 there's some follower whose log only is 58:25 the first two log entries we no longer 58:27 have the log entry three that's required 58:29 to send it to that follower in an append 58:32 entries RPC to allow its log to catch up 58:35 to the leaders now 58:41 we could avoid this problem by having 58:44 the leader never drop part of its log if 58:47 there's any follower out there that 58:50 hasn't caught up to the point at which 58:53 the leader is thinking about doing a 58:54 snapshot because the leader knows 58:56 through next index 58:58 well actually leader doesn't really know 59:00 but the leader could know in principle 59:02 how far each follower had gotten and the 59:05 leader could say well I'm just never 59:06 gonna drop the part of my log before the 59:09 end of the follower with the shortest 59:12 log and that would be okay they might 59:16 actually just be a good idea period the 59:20 reason why that's maybe not such a great 59:21 idea is that of course if a follower 59:23 shut down for a week you know it's not 59:26 gonna be acknowledging log entries and 59:28 that means that the leader can't reduce 59:31 its memory use by snapshotting so the 59:34 way the raft designs chosen to go is 59:36 that the leader is allowed to throw away 59:40 parts of its logs that would be needed 59:42 by some follower and so we need some 59:43 other scheme that append entries to deal 59:45 with the gap between the end of some 59:48 followers log in the beginning of the 59:49 leaders log and so that solution is the 59:51 install snapshot RPC and the deal is 60:02 that when a leader we have some follower 60:06 whose log is that you know just powered 60:09 on its log as short the leaders gonna 60:12 send it and append entries and you know 60:14 it's gonna be forced the leaders gonna 60:15 be forced to backup and at some point 60:17 the leader you know failure or fail 60:19 dependent recalls will cause the leader 60:20 to realize it it's reached the beginning 60:23 of the actual log its doors and at that 60:25 point instead of sending in append 60:27 entries the leader will send its current 60:30 snapshot plus current law well send its 60:33 current snapshot to the follower and 60:35 then presumably immediately follow it 60:37 with an append entries that has the 60:40 leaders current law 60:46 questions 60:52 yeah I'm the sad truth this is like this 60:55 is adds significant complexity here 60:59 Jarrell I'm three partially because of 61:02 the kind of cooperation that's required 61:05 between raff this is sort of a little 61:07 bit of a violation of modularity it 61:08 requires a good deal cooperation like 61:12 for example when an install snapshot 61:13 comes in it's delivered to raft but raft 61:15 really requires the application to 61:17 absorb the snapshot so they have to talk 61:23 to each other more than they otherwise 61:24 might yes the question is that this is 61:33 the way the snapshot is created 61:35 dependent on the application 61:36 it's absolutely it so the snapshot 61:38 creation function is part of the 61:40 application as part of like the key 61:42 value server so raffle you know somehow 61:45 call up to the application and say geez 61:46 you know I really like a snapshot right 61:48 now in the application because only the 61:50 application understands what it's status 61:53 and you know the inverse function by 61:57 which an application reconstructs a 61:59 state from a snapshot files also totally 62:01 application dependent where there's 62:05 intertwining because of course every 62:06 snapshot has to be labeled with a point 62:09 in a log that it corresponds to 62:25 talking about rule six and figure 62:27 thirteen okay so yeah the question here 62:39 is that and you will be faced with this 62:42 in lab three that because the RPC system 62:46 isn't perfectly reliable and perfectly 62:48 sequenced and RBC's can arrive out of 62:50 order or not at all or you may send an 62:52 RPC and get no response and think it was 62:54 lost but actually was delivered and was 62:56 the reply that was lost all these things 62:58 happen including to send to whatever 63:02 install snapshot our pcs and the leaders 63:04 almost certainly sending out many our 63:06 pcs concurrently you know both append 63:08 entries and install snapshots that means 63:12 that you can get things like install 63:15 snapshot our pcs from deep in the past 63:20 almost anything else right and therefore 63:25 the the follower has to be careful you 63:29 know has to think carefully about an 63:31 install snapshot that arrives and the 63:37 yeah I think the specific thing you're 63:39 asking is that if follower receives that 63:41 an install snapshot that appears to be 63:43 completely redundant that is the install 63:46 snapshot contains information that's 63:47 older than the information the follower 63:50 already has 63:51 what should the follower do and rule six 63:55 and figure thirteen says something but I 63:57 think equally valid response to that is 63:59 that the follower can ignore a snapshot 64:01 that clearly is from the past I don't 64:07 really understand that rule six okay I 64:12 want to move on to sort of somewhat more 64:17 conceptual topic for a bit so far we 64:21 haven't really tried to nail down 64:24 anything about what it meant to be 64:27 correct 64:28 what I meant for a replicated service 64:33 already any other kind of service to be 64:36 behaving correctly and the reason why 64:39 and you know whatever for most of my 64:42 life I managed to get by without 64:44 worrying too much about precise 64:46 definitions of correctness but the fact 64:47 is that you know if you're trying to 64:49 optimize something or you're trying to 64:51 think through some weird corner case 64:53 it's often handy to actually have a more 64:55 or less formal way of deciding is that 64:58 behavior correct or not correct and so 65:00 you know for here what we're talking 65:01 about is you know clients are sending in 65:03 requests to the to our replicated 65:05 service with our PC maybe they'll be 65:07 sending who knows well maybe the service 65:09 is crash it can be starting and you know 65:11 loading snapshots or whatever the client 65:14 sends in a request and gets a response 65:15 like is that response correct how are we 65:18 supposed to how are we supposed to tell 65:20 whether response a would be correct or 65:22 response B so we need a notion we need a 65:26 pretty formal notion of distinguishing 65:27 oh that's okay from now that would be a 65:30 wrong answer and for this lab the our 65:33 notion of correctness is linearize 65:36 ability and I mentioned strong 65:42 consistency and some of the papers I 65:43 mentioned strong consistency and 65:45 basically equivalent to linearize 65:47 ability linearize ability is a sort of a 65:50 formalization of more or less of the 65:54 behavior you would expect if there was 65:57 just one server and it didn't crash and 65:59 it executed the command client requests 66:02 one at a time and you know nothing funny 66:04 ever happened so it has it has a 66:09 definition and the definition I'll write 66:12 out the definition then talk about it so 66:14 so an execution history is linearizable 66:24 linearizable and this is in the notes if 66:30 there exists a total order so an 66:33 execution history is a sequence of 66:34 client requests maybe many requests from 66:37 many clients 66:39 if there's some total order of the 66:46 operations in the history it matches the 66:53 real-time order of requests so if one 66:55 request 66:56 if client sends out a request and gets a 66:57 response and then later in time another 67:01 client sends out a request and I get a 67:02 response those two requests are ordered 67:04 because one of them's started after the 67:07 other one finished 67:08 so it's linearizable history is 67:12 linearizable if there exists an order of 67:13 the operations in the history that 67:15 matches real-time for non concurrent 67:23 requests that is for a request to didn't 67:25 overlap in time and each read you can 67:42 think of it as each read sees the value 67:44 from the most immediately preceding 67:46 right to the the same piece of data most 67:56 recent right in the order all right this 68:08 is the definition let me illustrate what 68:10 it means by running through an example 68:12 so first of all the history is a record 68:15 of client operations so this is a 68:16 definition that you can apply from 68:18 outside this definition doesn't appeal 68:20 in any way to what happens inside the 68:23 implementation or how the implementation 68:24 works it's something that we can if we 68:27 see a system operating and we can watch 68:30 the messages that come in and out we can 68:32 answer the question was that execution 68:34 that we observe linearizable so let me 68:44 let me write out of history and talk 68:47 about why it is or isn't linearizable 68:53 all right so here's an example the new 69:01 eyes ability talks about operations that 69:03 start at one point and end at another 69:05 and so this corresponds to the time at 69:07 which a client sends a request and then 69:10 later receives a reply so let us suppose 69:13 that our history says that at at some 69:16 particular time this time some client 69:19 sent a write request for the data item 69:22 named X and asked for it to be set to 1 69:24 and then time passed and at the second 69:28 vertical bar is when that client got a 69:30 reply through send a request at this 69:31 point you know time pass who knows 69:33 what's happening when the client got a 69:34 reply there and then later in time that 69:37 client or some other client doesn't 69:39 really matter 69:40 sends a write request again for item X 69:43 and value 2 and gets a response to that 69:45 right meanwhile some client sends a read 69:52 for X and gets value 2 and sent the 69:56 request there and got the response with 69:58 value 2 there and there's another 70:00 request that we observed it's a part of 70:03 the history request was sent to read 70:07 value X and it got value 1 back and so 70:12 when we have a history like this you 70:14 know the question were that you asked 70:16 about this history is is this a 70:17 linearizable history that is did the 70:20 machinery did the service did the system 70:22 that produced this history and was that 70:23 a linearizable system or did it produce 70:28 a linearizable history in this case if 70:30 this history is not linear inaudible 70:31 then then Lisa we're talking about I 70:36 have 3 we know we have a problem there 70:38 must be some some bug ok so we need to 70:42 analyze this to figure out if it's 70:43 linearizable there's linear linearize 70:45 ability requires us to produce an order 70:48 you know one by one order of the four 70:52 operations in that history so we know 70:54 we're looking for an order and there's 70:55 two constraints on the 70:57 order one is if one operation finished 71:03 before another started then the one that 71:07 finished first has to come first in the 71:08 history the other is if some read sees a 71:13 particular written value then the read 71:17 must come after the write in the order 71:20 all right so we want to order so we're 71:23 gonna produce an order that has four 71:24 entries the two rights and the two leads 71:26 I'm gonna draw with arrows that 71:29 constraints implied by those two rules 71:31 and then our order is gonna have to obey 71:33 these constraints so one constraint is 71:36 that this write finished before this 71:39 write started and therefore one of the 71:41 ordering constraints is that this write 71:44 must appear in the total order before 71:47 this write this read saw the value of 71:51 two so in the total order the most 71:56 recent right that this read must come 71:59 after this right and this write must be 72:00 the most recent right so that means that 72:03 in the total order we must see the right 72:06 of X - 2 and then after it the read of X 72:08 it yields - and this this read of X of 1 72:19 if we assume that the X didn't already 72:21 have the value 1 there there must be in 72:23 this relationship and that is the read 72:27 must come after the right and this read 72:29 also must become for this right and 72:32 maybe there's some other restrictions - 72:35 anyway we can take these we can take 72:37 this set of arrows and flatten it out 72:39 into an order and that actually works so 72:41 the order that's the total order that 72:44 demonstrates that this history is 72:45 linearizable is first the right of x - 1 72:50 then the read of x yielding 1 then the 72:56 right of x - 2 and the read of x that 73:00 yields 2 73:03 alright so the fact that there is this 73:06 order that does obey the ordering 73:07 constraints shows that this history is 73:09 linearize ability and doesn't you know 73:13 if we're worried about the system that 73:15 produced this history whether it's a but 73:17 that system is linearizable then this 73:20 particular example we saw it doesn't 73:22 contradict the presumption that the 73:24 system is linearizable any questions 73:29 about what I just did each read sees you 73:45 know read of X the value it sees must be 73:48 them value written by the most the most 73:51 recent proceeding right in the order so 73:56 you know in this case in this case we're 73:58 totally ok with this order because this 74:00 read the value it saw is indeed the 74:03 value written by the most recent write 74:04 in this order and this read the value it 74:08 sighs I mean in informally it's that 74:12 reads can't real should not be yielding 74:15 stale data if I write something in Rita 74:17 back gosh I should see the value I wrote 74:20 and that's like a formalization of the 74:21 notion that 74:27 oh yes oh yeah yeah all right let me let 74:34 me he's right up example that's not 74:40 indeed linearizable so example two let's 74:44 suppose our history is we had a right of 74:48 X value one right back with value two 75:14 and so this one we also want to write 75:16 out the arrows and so we know what the 75:20 constraints are on any total order we 75:21 might find the right of X to one because 75:26 of time because it finished in real time 75:28 before the right x to started and must 75:31 come before in any satisfying order we 75:36 produce the right of Ecsta two has to 75:38 come before the right before the read of 75:41 X that yields two so we have this arrow 75:46 the read of X had to finished before the 75:49 read of X to one started so we have this 75:51 arrow and the read of X to one because 76:00 it saw value one has to come after the 76:03 right of X - 1 and more crucially before 76:06 the right of X 2 - right so we can't 76:09 have this read of X yielding one if it's 76:12 immediately preceded by I'll write out X 76:14 - 2 so we also have this arrow like this 76:18 and because there's a cycle in these 76:23 constraints there's no order that can 76:27 obey all these constraints and therefore 76:29 this history is not linearizable and so 76:35 the system that produced it is 76:37 is not a linearizable system you know 76:42 would be linearizable the history was 76:44 missing any one of these three and I 76:47 would break the cycle yes maybe I'm not 77:05 sure because suppose or I don't know how 77:08 to incorporate very strange things like 77:11 supposing somebody red 27 you know it 77:16 doesn't really if there's no right of 27 77:18 a read of 27 doesn't at least the way 77:22 I've written out the rules doesn't sort 77:23 of well there may be some sort of anti 77:26 dependency that you would construct okay 77:29 um I will continue this discussion next 77:33 week