字幕記錄 00:01 alright last time I started talking 00:05 about linearize ability and I want to 00:06 finish up this time the reason why we're 00:09 talking about it again is that it's our 00:11 kind of standard definition for what 00:14 strong consistency means in storage 00:19 style systems so for example lab 3 is a 00:23 needs to obey your lab 3 needs to be 00:28 linearizable and sometimes this will 00:31 come up because we're talking about a 00:32 strongly consistent system and we're 00:34 wondering whether a particular behavior 00:35 is acceptable and other times linearize 00:39 ability will become come up because 00:40 we'll be talking about a system that 00:42 isn't linearizable and we'll be 00:44 wondering you know in what ways might it 00:47 fall short or deviate from linearize 00:49 ability so one thing you need to be able 00:51 to do is look at a particular sequence 00:55 of operations a particular execution of 00:59 some system that executes reads and 01:00 writes like your lab 3 and be able to 01:03 answer the question oh was that was that 01:04 Stevens of operations I just saw 01:06 linearizable or not I'm so we're going 01:10 to continue practicing that a little bit 01:11 now plus I'll try to actually establish 01:16 some interesting facts that will be 01:17 helpful for us about what it means about 01:19 the consequences for the systems we 01:21 build and look at linearize ability is 01:26 to find on particular operation history 01:28 so always the thing we're talking about 01:31 is oh we observed you know a sequence of 01:34 requests by clients and then they got 01:36 some responses at different times and 01:39 they asked for different different you 01:41 know to read different data and got 01:44 various answers back you know is that 01:46 history that we saw linearizable ok so 01:50 here's an example of a history that 01:53 might or might not be linearized able so 01:55 let's suppose at some point in time some 01:57 client groups of times gonna move to the 01:59 right this vertical bar marks the time 02:02 at which a client sent a request I'm 02:05 gonna use this notation to mean that the 02:08 request is a write and asks to set 02:12 variable or key or whatever x2 value 0 02:17 so sort of a key and a value this would 02:20 correspond to a put of key X and by zero 02:23 in lab 3 and then this is sort of we're 02:28 watching what the client send the client 02:29 sent this request to our service and at 02:32 some point the service responded and 02:34 said yes you're right is completed so 02:37 we're assuming the services of a nature 02:38 that actually tells you when the write 02:40 completes otherwise the definition isn't 02:43 very useful ok so we have this request 02:46 by somebody to write and then I'm 02:49 imagining in this example there's 02:51 another request that because I'm putting 02:54 this mark here this means the second 02:56 request started after the first request 02:59 finished and and you know reason why 03:01 that's important is because of this rule 03:04 that linearizable history must match 03:07 real time and what that really means is 03:10 that requests that are known in real 03:12 time to have started after some other 03:15 request finished the second request has 03:18 to occur after the first request in 03:20 whatever order we work out that's the 03:22 proof that the history was a linearized 03:24 linearize available ok so in this 03:27 example I'm imagining there's another 03:29 request that asks to write X to have 03:31 value 1 and then a concurrent request 03:36 may be started a little bit later as to 03:39 set X to 2 I said now we have two maybe 03:44 two different clients issued requests at 03:46 about the same time to set X to two 03:47 different values so of course we're 03:49 wondering which one is going to be the 03:51 real value and then we also have some 03:54 reads if all you have is writes well 04:01 well you have us right so it's it's hard 04:04 to say too much about linearizable 04:05 linearize ability because you don't know 04:07 you don't have any proof that the system 04:10 actually did anything or revealed any 04:12 values so we really need reads so let's 04:17 imagine we have some read unless you'll 04:21 be seeing our in the history 04:23 that a client said to read at this time 04:27 and the second time it got an answer for 04:30 red key accent got value to so 04:34 presumably actually saw this value and 04:37 then there was another request by maybe 04:39 the same client or a different client 04:40 but known to have started in time after 04:43 this request finished and this read of X 04:46 got value while and so the question in 04:50 front of us is is this history 04:52 linearizable and there's sort of two 04:55 strategies we can take we can either 04:58 cook up a sequence because if we can 05:00 come up with a total order of these five 05:03 operations that obeys real time and in 05:06 which each read sees the value written 05:09 by the priest most recently proceeding 05:12 right in the order if we can come up 05:13 with that order then that's a proof the 05:15 history is linearizable another strategy 05:17 is to observe that these rules each one 05:23 may imply certain this comes before that 05:26 edges in a graph and if we can find a 05:29 cycle in this operation must come before 05:32 that operation we can find a psych on 05:33 that graph and that's proof that the 05:35 history isn't linearizable and for small 05:39 histories we may actually be able to 05:40 enumerate every single order and use 05:42 that show this history isn't 05:45 linearizable anyway any any any thoughts 05:51 about whether this might be or might not 05:53 be linearizable 06:00 yes 06:08 yes okay so the observation is that um 06:11 it's a little bit troubling that we saw 06:15 read with IU - and then the read with 06:17 value want and maybe that contradicts 06:21 you know there were two rights one with 06:23 value one on one value - so that so we 06:25 certainly if we had to read with value 06:26 three that would obviously be something 06:28 I got terribly wrong you know but we got 06:31 there were a right of one in two and a 06:32 read of one and two so the question is 06:34 whether this order of reads could 06:36 possibly be reconciled with the way 06:39 these two rights show up in the history 06:59 okay so what I'm what I'm the game we're 07:04 playing is that we have a like maybe two 07:07 clients or three clients and they're 07:08 talking some service you know maybe a 07:10 raft last year something and what we are 07:12 seeing is requests and responses right 07:15 so what this means is that we saw 07:18 requests from a client to write X to the 07:21 you know put requests for X and one and 07:23 we saw the response here so what we know 07:25 is that somewhere during this interval 07:27 of time presumably the service actually 07:29 internally change the value of x - 1 and 07:32 what this means is that somewhere in 07:34 this interval of time the service 07:38 presumably changed its internal idea of 07:42 the value of x - 2 somewhere in this 07:44 time but you know it's just somewhere in 07:47 this time it doesn't mean it happened 07:48 here or here does that answer your 07:52 question 07:53 yes 08:07 yes okay so the observation is that is 08:10 linearizable and it's been accompanied 08:13 by an actual proof of the linearize 08:15 ability namely a demonstration of the 08:17 order that shows that it is linearizable 08:19 and the order is yes it's linearizable 08:25 and the order is first right of X with 08:32 value 0 and the server got both of these 08:36 rights at roughly the same time it's 08:38 still had to choose the order itself all 08:40 right so let's just say it could have 08:43 executed the right of x2 value 2 first 08:45 and then the read of X then executed the 08:54 read of X which would the first read of 08:58 X which at that point would yield 2 and 09:01 then we're gonna say the next operation 09:03 had executed it was the right of X to 1 09:05 and then the last operation in the 09:08 history is the read of X to 1 and so 09:16 this is proof that the history is 09:17 linearizable because here's an order 09:19 it's a total order of the operations and 09:22 this is the order it matches real time 09:24 so what that means is well just go 09:29 through it the the right of X to 0 comes 09:31 first and that's that's totally 09:32 intuitive since it's actually finished 09:33 before any other operations started the 09:37 right of X to 1 comes sorry the rate of 09:40 X to 2 comes second so we're gonna say 09:42 maybe that I'm gonna mark here that sort 09:46 of real time at which we imagine these 09:48 operations happen to demonstrate that 09:50 the order here does match real time so 09:52 it'll say I'll just write a big X here 09:54 to mark the time when we imagine this 09:56 operation happened all right so that's 09:59 the second operation then we're 10:01 imagining that the next operation is the 10:03 read of X of 2 we you know there's no 10:07 real time problem because the read of X 10:08 of 2 actually was this u concurrently 10:11 with the right of X of 2 you know it's 10:13 not like the right of X the read of X of 10:14 2 finished and only then did the right 10:16 of X right of X with to start there 10:18 really are concurrent we'll just imagine 10:20 that that 10:21 sort of point in time at which this 10:23 operation happened is right there so 10:25 this is the you know we don't care when 10:27 this one happened let's just say there's 10:28 the first operation second third now we 10:33 have a right of X of one let's just say 10:36 it happens here in real time just has to 10:39 happen after the operations that occur 10:42 before it in the order so that will say 10:43 there's the fourth operation and now we 10:45 have the read of x1 and it can pretty 10:47 much happen at any time but let's say it 10:48 happens here okay so this is the 10:52 Diamonds 10:52 so we have the order this is the 10:54 demonstration that the order is 10:56 consistent with real time that is we can 10:58 pick a time for each of the operations 11:00 that's within its start and end time 11:02 that would cause this total order to 11:06 match our real time order and so the 11:08 final question is did each read see the 11:12 value written by the most closely 11:14 preceding right of the same variable so 11:16 there's two V's this read preceded by a 11:20 right with that correct value so that's 11:21 good and this read is preceded by a 11:23 right most closely preceded by a right 11:26 of the same value also okay so this this 11:32 is a demonstration that this history was 11:34 linearizable and you know the you know 11:39 depends on what you thought when you 11:40 first saw the history but it's not 11:42 always immediately clear that set up 11:44 this complicated is you know it's easy 11:47 to be tricked when looking at these 11:50 histories which do you think oh the 11:51 right of x1 started first so we just 11:53 sort of assumed that the first value 11:56 written must be one but that's actually 11:57 not required here any questions about 12:03 this 12:15 if the you mean if these two were moved 12:18 like this the okay so if if if this if 12:23 the right with value to was only issued 12:26 by the client after the read of accent 12:30 value to returned that wouldn't be 12:33 linearizable because in whatever order 12:37 you know any order we come up with has 12:39 to obey the real-time ordering so any 12:41 order we come up with would have had to 12:42 have the read of X with to precede the 12:46 right of X with 2 and since there's no 12:48 other right of X of 2 insight here that 12:52 means that a read at this point could 12:55 only see 0 or 1 because those are the 12:57 only other 2 rights that could possibly 12:59 come before this read so moving you know 13:03 shifting these that much makes the would 13:06 make the example not linearizable 13:19 yes 13:24 I'm saying that the first vertical line 13:27 is the moment the client sends the 13:29 request and the second vertical line is 13:32 the moment the client receives the 13:34 request yes yeah yeah so this is a very 13:39 client centric kind of definition it 13:42 says you know clients should see the 13:44 following behavior and what happens 13:46 after us send a request in maybe there's 13:48 a lot of replicas maybe a complicated 13:49 network who knows what it's almost none 13:51 of our business we're only the 13:53 definition is only about what clients 13:56 see there's some gray areas which we'll 13:59 come to in a moment like if the client 14:01 should need to retransmit a request then 14:03 we also have to you know that's 14:05 something we have to think about okay so 14:12 this one is linearizable here's another 14:16 example I'm actually going to start out 14:19 with it being almost identical I'm gonna 14:22 start out with you being identical for 14:23 the first example so again we have a 14:25 right of X with 0 we have these two 14:28 concurrent rights and we have the same 14:38 two reads those are so far identical to 14:52 the previous example so therefore we 14:54 know this must be this alone must be 14:56 minimal but I'm going to add let's let's 14:58 imagine that client 1 issued these two 15:02 requests the definition doesn't really 15:04 care about clients but her own sanity 15:06 will assume client 1 red X and saw two 15:08 and then later red X and saw one but 15:11 that's okay so far I say there's another 15:13 client and the other client does a read 15:17 of X and it sees a 1 and then the other 15:23 client is a second read of X and it sees 15:25 - so this is linearizable and we either 15:31 have to come up with an order 15:33 or this comes before that graph that has 15:39 a cycle in it 15:50 so you know that thing this is getting 15:54 at the puzzle is if one client saw 15:57 there's only two rights here so they you 15:59 know in any order or one of the rights 16:01 comes first or the other rate comes 16:03 first and intuitively client one 16:08 observed that the right with value to 16:10 came first and then the right of value 16:14 one right these two reads mean that has 16:19 to be the case that in any legal order 16:21 of the right of two has to come before 16:23 the right of one in order for the climb 16:25 when to have seen this and it's the same 16:27 order we saw over here but symmetrically 16:31 client one's experience clearly shows 16:33 the opposite right sorry huh fine to 16:39 client who's experience was the opposite 16:41 clients to saw the right of one first 16:44 and then the right with value too and 16:47 one of the rules here is that there's 16:52 just one total order of operations not 16:56 allowed to have different clients see 16:58 different histories or different 17:00 different progressions evolutions of the 17:03 values that are stored in the system 17:05 there can only be one total of order 17:07 that all clients have to experience 17:09 operations that are consistent with the 17:11 one order and if one this one client 17:16 clearly implies that the order is right 17:18 - and then right one and so we should 17:21 not be able to have any other client who 17:23 observes proof that the order was 17:25 anything else which is what we have here 17:29 and so that's a bit of a intuitive 17:34 explanation for what's going wrong here 17:37 and and by the way the reason why this 17:38 could come up in the systems that we 17:41 build and look at is that we're building 17:43 replicated systems either you know raft 17:46 replicas or maybe systems with caching 17:48 in them but we're building systems that 17:50 have many copies of the data and so 17:52 there may be many servers with copies of 17:54 X in them possibly with different values 17:57 at different times right if they haven't 17:59 gotten the commits yet or something some 18:01 replicas may have one value some may of 18:03 the other 18:03 but in spite of that if our system is 18:07 linearizable or strongly consistent it 18:09 must behave as if there was only one 18:13 copy of the data and one linear sequence 18:16 of operations applied to the data that's 18:18 why this is an interesting example 18:20 because this could come up in a sort of 18:22 buggy system that had two copies of the 18:25 data and one copy executed these rights 18:27 in one order and the other replicas 18:28 executed the rights in the other order 18:30 and then we could see this and linearize 18:32 ability says no we can't see that we're 18:34 not allowed to see that in the correct 18:35 system so the the cycle in the graph in 18:39 the this comes before that graph that 18:44 would be a sort of slightly more proof e 18:45 proof that this is not linearizable is 18:47 that the right of two has to come before 18:51 client ones read of two so there's one 18:55 arrow like this so this right has to 18:59 come before that read client ones read 19:05 has to come before the right of X with 19:09 value one otherwise this read wouldn't 19:13 be able to see one right if this you can 19:15 imagine this right might happen very 19:17 early in the order but in that case this 19:20 read of X wouldn't see one it would see 19:21 two since we know this guy saw two so 19:25 the read of X with two must come before 19:28 the right of X with one the right of X 19:32 of one must come before any read of X 19:34 with value 1 because including client 19:37 who's read of X with value 1 but in 19:42 order to get value 1 here and for this 19:44 read to see to the right of X with I too 19:47 must come between in in the order 19:50 between these two operations so we know 19:52 that the read of X 1 must come before 19:55 the right of X 2 and that's a cycle 19:59 alright so there's no there's no Vinnie 20:03 or order or that but there's no linear 20:06 order that can obey all of these time 20:10 and value rules and there isn't because 20:14 there's a cycle in the 20:16 in the graph yes 20:32 that's a good question this this 20:35 definitions the definition about 20:37 history's not about necessarily systems 20:41 so what it's not saying is that a system 20:43 design is linearizable if something 20:46 about the design it's really only 20:50 history by history so if we don't get to 20:53 know how the system operates internally 20:55 and the only thing we know is we get to 20:57 watch it while it executes then before 20:59 we've seen anything we just don't know 21:01 right we mean we'll assume it's 21:02 linearizable and then we see more and 21:04 more sequences of operations this Akash 21:06 they're all consistent with linearize 21:08 ability they all follow these rules so 21:10 you know we believe it's probably this 21:12 isn't linearize of all and if we ever 21:14 seen one that isn't then we realize it's 21:15 not linearizable so this is yeah it's 21:20 not a definition on the system design 21:22 it's a definition on what the what we 21:23 observe the system to do so in that 21:25 sense it's maybe a little bit 21:27 unsatisfying if you're trying to design 21:28 something right there's not a recipe for 21:30 how you design you know except in a 21:32 trivial sense that if you had a single 21:33 server in very simple systems one server 21:37 one copy of the data not threaded or 21:40 multi-core or anything it's a little bit 21:42 hard to build a system that violates 21:43 this in a very simple set up but super 21:46 easy to violate it in any kind of 21:50 distributed system okay so the lesson 21:55 from this is that there's only can only 21:59 be one order in which the system is 22:04 observed to execute the writes all 22:07 clients have to see value is consistent 22:10 with the system executing the writes in 22:13 the same order 22:18 here's a very simple history 22:19 another example supposing we write acts 22:24 with value 1 and then definitely 22:27 subsequently in time maybe with another 22:29 client another client launches a right 22:32 of X with value 2 and sees a response 22:34 back from the service saying yes I did 22:36 the right and then a third client does a 22:38 read of X and gets got you one so this 22:45 is a very easy example it's clearly not 22:47 linearizable because the time rule means 22:50 that the only possible order is the 22:54 right of X with 1 the right of X is 2 22:55 the read of X with 1 so that has to be 22:57 the order and that order clearly 22:59 violates this is the only one order that 23:01 order clearly violates the second rule 23:03 about values that is you know the most 23:06 value written by the most recent right 23:08 in the owned one order that's possible 23:10 is not 1 it's 2 so this is clearly not 23:13 linearizable and the reason I'm bringing 23:18 it up is because this is the argument 23:20 that a linearizable system a strongly 23:23 consistent system cannot serve up stale 23:25 data right and you know the reason why 23:29 this might come up is again you know 23:31 maybe you have lots of replicas each you 23:34 know maybe haven't seen all the rights 23:36 or all the committed rights or something 23:37 so maybe there's some maybe all the 23:40 replicas have seen this right but only 23:42 some replicas have seen this right and 23:43 so if you ask a replica that's lagging 23:45 behind a little bit it's still gonna 23:47 have value 1 for X but nevertheless 23:50 clients should never be able to see this 23:53 old value in a linearizable system are 23:57 there no stale data allowed no still 24:00 reads 24:21 yeah if there's overlap in the interval 24:23 then there's then you know that you 24:26 could the system could legally execute 24:29 either of them in a real-time and I in 24:31 the interval so that's the sense in 24:32 which they could system gonna execute 24:35 them in either order now you know other 24:38 you know if it weren't for these two 24:40 reads the system would have you know 24:43 total freedom execute that writes in 24:45 either order but because we saw the two 24:47 reads we know that you know the only 24:52 legal order is two and then one yeah so 25:01 if the two reserva laughing then and 25:03 then any order then the reads could have 25:04 seen either in fact you know Toby saw 25:07 the two and the one words all from the 25:08 reads these doobies could have you know 25:11 the system until it committed to the 25:13 values for the read it still had freedom 25:15 to return them in either order 25:23 I'm using them as synonyms yeah yeah you 25:28 know for most people although possibly 25:31 not today's paper linearize ability is 25:34 is well defined and and people's 25:37 definitions really deviate very much 25:39 from this strong consistency though is 25:43 less I think there's less sort of 25:45 consensus about exactly what the 25:47 definition might be if you meant strong 25:49 consistency it's often men it's usually 25:53 men too in ways that are quite close to 25:55 this 25:55 like for example that oh the system 25:57 behaves the same way that a system with 26:00 only one copy of the data would behave 26:02 all right which is quite close to what 26:05 we're getting at with this definition 26:06 but yeah for you know it's reasonable to 26:10 assume that strong strong consistency is 26:12 the same as serializable okay so this is 26:18 not linearizable and the you know the 26:21 the lesson is weeds are not allowed to 26:26 return stale data only only fresh data 26:29 or you can only return the results of 26:33 the most recently completed right 26:44 okay I have a final final little example 26:51 so we have two clients one of them 26:54 submits a write to X with value three 26:58 and then write two acts with value 4 and 27:04 we have another client and you know at 27:07 this point in time the client issues a 27:10 read of X but and this is a question you 27:15 asked the the client doesn't get a 27:18 response right you know who knows like 27:21 it in the sort of actual implementation 27:23 may be the leader crashed at some point 27:25 maybe the his client to sent in the read 27:27 request so the leader maybe didn't get 27:29 it because the request was dropped or 27:31 maybe the leader got the request and 27:34 executed it but the response the network 27:36 dropped the response or maybe the leader 27:38 got it and started to process up and 27:40 then crash before finished processing 27:42 and or maybe did process it and crash 27:44 before saying the response who knows 27:45 when the clients point of view like sent 27:48 a request and never got a response so in 27:50 the interior machinery of the client for 27:52 most of the systems we're talking about 27:53 the client is going to resend the 27:55 request maybe do a different leader 27:57 maybe the same leader who knows what so 27:59 it sent the first question quest here 28:01 and maybe it sends the second request at 28:05 this point in time it times out you know 28:07 no response sends the second request at 28:10 this point and then finally gets a 28:12 response it turns out that and you're 28:19 going to implement this in lab 3 that a 28:22 reasonable way of servers dealing with 28:26 repeated requests is for their servers 28:28 to keep tables sort of indexed by some 28:31 kind of unique request number or 28:32 something from the clients in which the 28:35 servers remember oh I already saw that 28:36 request and executed it and this was the 28:38 response that I sent back because you 28:41 don't want to execute a request twice 28:42 you know if it's a for example if it's a 28:45 write request you don't want to execute 28:47 requests right so the server's have to 28:49 be able to filter out duplicate requests 28:51 and they have to be able to return the 28:53 reply to repeat the reply that the 28:56 originally 28:57 sent to that request which perhaps has 28:59 dropped by the network so that servers 29:00 remember the original pry and repeat it 29:04 in response to the resend and if you do 29:07 that which you will in lab 3 then if you 29:12 know since the server the leader could 29:16 have seen value 3 when it executed the 29:18 original read request from client to it 29:20 could return value 3 to the repeated 29:23 requests that was sent at this time and 29:25 completed at this time and so we have to 29:30 make a call on whether that is legal 29:33 right you could argue that oh gosh you 29:38 know the client we sent the request here 29:39 this was after the right of X 2 4 29:41 completed so Jesus what you really 29:42 should return for at this point instead 29:44 of 3 and this is like a little bit a 29:52 question of it's like a little bit up 29:55 the designer but if what you view is 29:58 going on is that the retransmissions are 30:00 a low-level concern that's you know part 30:05 of the RPC machinery or hidden in some 30:07 library or something and that from the 30:08 client applications point of view all 30:10 that happened was that it's sent a 30:12 request at this time and got a response 30:15 at this time and that's all that 30:17 happened from the clients point of view 30:18 then a value of 3 is totally legal here 30:21 because this request took a long time 30:24 it's completely concurrent with the 30:26 right not ordered in real time with the 30:28 right and therefore either the three or 30:31 the four is valid you know as if the 30:34 read requests that really executed here 30:36 in real time or or here in real time so 30:39 the larger lesson is if you have client 30:42 retransmissions the from the application 30:47 point of view if you're defining 30:48 linearize ability from the applications 30:50 point of view - even with 30:53 retransmissions the real time extent of 30:56 the requests like this is from the very 30:59 first transmission of the requests to 31:01 the final time at which the application 31:03 actually got the response maybe after 31:05 many reasons yes 31:24 you might rather you got fresh data than 31:27 stale data you know if I'm you know 31:30 supposing the request is what time it 31:32 what time is it that's a time server I 31:34 sent a request saying Oh what time is it 31:35 and it sends me a response you know yeah 31:38 if I send a request now and I don't get 31:40 the response until 2 minutes from now 31:42 dude some Network issue it may be that 31:46 the application would like prefer to see 31:48 we're gonna get the response it would 31:50 prefer to see a time that was close to 31:53 the time at which had actually got the 31:54 response rather than a time deep in the 31:56 past when it originally sent the request 31:58 now the fact is that if you you know if 32:02 you're using a system like this you have 32:03 to write applications that are tolerant 32:05 of these rules you're using a 32:09 linearizable system like these are the 32:11 rules and so you must write you know 32:13 correct applications must be tolerant of 32:15 you know if they send a request and they 32:17 get a response a while later they just 32:19 you know you can't are not allowed to 32:22 write the application as if oh gosh if I 32:25 get a response that means that the value 32:27 at the time I got the response was equal 32:30 to 3 that is not OK for applications to 32:32 think you know what that I have that 32:34 plays out for a given application 32:35 depends on what the application is doing 32:40 the reason I bring this up is because 32:42 it's a common question in 6 6 8 to 4 you 32:45 guys will implement the machinery by 32:47 which servers detect duplicates and 32:51 resend the previous answer that the 32:56 server originally sent and the question 32:57 will come up is it ok if you originally 33:00 saw the request here to return at this 33:02 point in time the response that you 33:05 would have sent back here if the network 33:08 hadn't dropped it and it's it's handy to 33:11 have a kind of way of reasoning I mean 33:13 one reason to have definitions like 33:15 linearize abilities to be able to reason 33:16 about questions like that right i'm 33:18 using using this scheme we can say well 33:21 it actually is okay by those rules 33:26 all right that's all i want to say about 33:28 linearize ability of any any lingering 33:32 questions yeah 33:45 well you know maybe I'm taking liberties 33:49 here but what's going on is that in real 33:55 time we have a read of - and a read of 33:57 one and the read of one really came 33:59 after in real time the read of two and 34:01 so must come must be in this order in 34:04 the final order that means there must 34:07 have been a right of - somewhere in here 34:11 it's our right with value one somewhere 34:13 in here that is after the read of - in 34:15 the final order right after the read of 34:17 - and before the read of one in that 34:20 order there must be a right with value 34:22 one there's only one right with a value 34:23 unavailable you know if there were more 34:25 than one we maybe could play games but 34:27 there's only one available so this right 34:29 must slip in here in the final order or 34:31 therefore I felt able to draw this arrow 34:38 and these arrows just capture the sort 34:42 of one by one implication of the rules 34:44 on what the order must look like yeah 35:02 all right yeah I mean any hour or X so 35:06 which sorry which which yeah his own rx1 35:16 he sees it before his own rx1 okay so 35:19 the via yep well we're not we're not 35:32 we're not really able to say which of 35:35 these two wheats came first so we can't 35:39 quite for all this error if we mean this 35:41 arrow to constrain the ultimate order 35:43 we're not you know the these two weeds 35:46 could come in either order so we're not 35:48 allowed to say this one came before that 35:49 one it could be there's a simpler cycle 35:52 actually then I've drawn so I mean it 35:55 may be because certainly the that the 35:58 damage is in these four items I agree 36:02 with that these two these four items 36:05 kind of are the main evidence that 36:08 something is wrong 36:09 now whether a cycle I'm not sure whether 36:11 there's a cycle that just involves that 36:13 there could be okay this is worth 36:16 thinking about cuz you know if I can't 36:19 think of anything better or I'll 36:20 certainly ask you a question about 36:21 linearizable histories on midterm 36:31 okay so today's paper today's paper 36:36 zookeeper and I mean part of the reason 36:43 we're even zookeeper paper is that it's 36:45 a successful real world system it's an 36:46 open source you know service that 36:50 actually a lot of people ron has been 36:51 incorporated into a lot of real world 36:53 software so there's a certain kind of 36:55 reality and success to it but you know 36:59 that makes attractive from the point of 37:00 view of kind of supporting the idea that 37:04 the zookeepers design might actually be 37:05 a reasonable design but the reason we're 37:08 interested in in it I'm interested in it 37:10 is for to somewhat more precise 37:12 technical points so why are we looking 37:18 at this paper so one of them is that in 37:23 contrast to raft like the raft you've 37:25 written and raft as that's defined it's 37:27 really a library you know you can use a 37:29 raft library as a part of some larger 37:31 replicated system but raft isn't like a 37:34 standalone service or something that you 37:36 can talk to it's you really have to 37:38 design your application to interact at 37:40 the raft library explicitly so you might 37:45 wonder it's an interesting question 37:46 whether some useful system sort of 37:52 standalone general-purpose system could 37:54 be defined that would be helpful for 37:56 people building separate distributed 37:59 systems like is there serve some service 38:00 that can bite off a significant portion 38:02 of why it's painful to build distributed 38:04 systems and sort of package it up in a 38:06 standalone service that you know anybody 38:09 can use so this is really the question 38:14 of what would an API look like for a 38:16 general purpose I'll call it I'm not 38:24 sure what the right name for things like 38:25 zookeeper is but you've got a general 38:27 purpose coordination service 38:33 and the other question the other 38:37 interesting aspect of zookeeper is that 38:41 when we build replicated systems and 38:44 zookeepers a replicated system because 38:45 among other things it's it's like a 38:47 fault-tolerant 38:48 general-purpose coordination service and 38:51 it gets fault tolerance like most 38:53 systems by replication that is there's a 38:55 bunch of you know maybe three or five or 38:57 seven or who knows what 38:58 zookeeper servers it takes money to buy 39:03 those servers right like a 7 server 39:05 zookeeper setup is 7 times expensive as 39:09 a sort of simple single server so it's 39:13 very tempting to ask if you buy 7 39:16 servers to run your replicated service 39:17 can you get 7 times the performance out 39:20 of your 7 servers right and you know how 39:24 could we possibly do that so the 39:29 question is you know we have n times as 39:31 many servers can that yield us n times 39:35 the performance so I'm gonna talk about 39:42 the second question first so from the 39:46 point of view this discussion about 39:47 performance I'm just going to view 39:50 zookeeper as just some service we don't 39:53 really care what the service is but 39:54 replicated with a raft like replication 39:57 system zookeeper actually runs on top of 39:59 this thing called Zab which for our 40:03 purposes 40:06 we'll just treat as being almost 40:09 identical to the raft and I'm just 40:14 worried about the performance of the 40:15 replication I'm not really worried about 40:17 what zookeepers specifically is up to so 40:20 the general picture is that you know we 40:22 have a bunch of clients maybe hundreds 40:24 maybe hundreds of clients and we have 40:27 just as in the lads we have a leader the 40:35 leader has a zookeeper layer that 40:37 clients talk to and then under the 40:39 zookeeper layer is the xab system that 40:42 manages replication then just like rafts 40:44 what was a a lot of what's that is doing 40:47 is maintaining a log that contains the 40:49 sequence of operations that clients have 40:51 sent in really very similar to raft may 40:57 have a bunch of these and each of them 41:01 has a log but it's a pending new request 41:10 that's a familiar set up so the 41:15 Clinton's in a request and the Zab layer 41:18 you know sends a copy of that request to 41:21 each of the replicas and the replicas 41:24 append this to their in-memory law I'd 41:25 probably persisted onto a disk so they 41:28 can get it back if they crash and 41:29 restart so the question is as we add 41:35 more servers you know we could have four 41:36 servers or five or seven or whatever 41:38 does the system get faster as we add 41:41 more more CPUs more horsepower to it do 41:48 you think your labs will get faster as 41:50 you have more replicas assuming they're 41:53 each replicas its own computer right so 41:56 that you really do get more CPU cycles 41:58 as you add more revenues 42:09 between all the 42:17 yeah yeah there's nothing about this 42:19 that makes it faster as you add more 42:20 servers right it's absolutely true like 42:23 as we have more servers you know the 42:25 leader is almost certainly a bottleneck 42:27 cuz the leader has to process every 42:28 request and it sends a copy of every 42:30 request to every other server as you add 42:31 more servers it just adds more work to 42:33 this bottleneck node right you're not 42:36 getting any benefit any performance 42:37 benefit out of the added servers because 42:39 they're not really doing anything 42:40 they're just all happily doing whatever 42:42 the leader tells them to do they're not 42:45 you know subtracting from the leaders 42:48 work and every single operation goes to 42:50 the leader so for here you know the 42:54 performance is you know inversely 42:56 proportional to the number of servers 42:58 that you add you add more servers this 42:59 almost certainly gets lower because the 43:02 leader just has more work so in this 43:04 system we have the problem that more 43:06 servers makes the system slower that's 43:15 too bad you know these servers cost a 43:16 couple thousand bucks each and you would 43:18 hope that you could use them to get 43:20 better performance yeah 43:33 okay so the question is what if the 43:35 requests may be from different clients 43:38 or successive requests and same client 43:39 or something what if the requests apply 43:41 two totally different parts of the state 43:43 so you know in a key value store maybe 43:45 one of them is a put on X and the other 43:46 was a put on Y like nothing to do with 43:48 each other you know can we take 43:52 advantage of that and the answer that is 43:55 absolutely now not in this framework 43:57 though or it's the center which we can 44:00 take advantage of it it's very limited 44:02 in this framework it could be well at a 44:06 high level the leader the requests all 44:08 still go through the leader and the 44:11 leader still has to send it out to all 44:13 the replicas and the more replicas there 44:15 are the more messages the leader has to 44:17 send so at a high level it's not likely 44:19 to this sort of commutative or community 44:23 of requests is not likely to help this 44:25 situation is a fantastic thought to keep 44:28 in mind though because it'll absolutely 44:29 come up in other systems and people will 44:32 be able to take advantage of it in other 44:34 systems okay so so there's a little bit 44:39 disappointing facts with server hardware 44:41 wasn't helping performance so a very 44:48 sort of obvious maybe the simplest way 44:52 that you might be able to harness these 44:54 other servers is build a system in which 44:57 ya write requests all have to go through 44:59 the leader but in the real world a huge 45:03 number of workloads are read heavy that 45:05 is there's many more reads like when you 45:06 look at web pages you know it's all 45:07 about reading data to produce the web 45:09 page and generally there are very 45:11 relatively few rights and that's true of 45:13 a lot of systems so maybe we'll send 45:15 rights to the leader but send weeds just 45:18 to one of the replicas right just pick 45:21 one of the replicas and if you have a 45:22 read-only request like a get in lab 3 45:24 just send it to one of the replicas and 45:26 not to the leader now if we do that we 45:29 haven't helped rights much although 45:30 we've gotten a lot of read workload off 45:32 the leader so maybe that helps but we 45:33 absolutely have made tremendous progress 45:36 with reads because now the more servers 45:38 we add the more clients we can support 45:42 right because we're just splitting the 45:44 client lead work 45:45 across the different replicas so the 45:48 question is if we have clients send 45:51 directly to the replicas are we going to 45:55 be happy 46:07 yeah so up-to-date does the right is the 46:10 right word in a raft like system which 46:13 zookeeper is if a client sends a request 46:17 to a random replica you know sure the 46:20 replica you know has a copy the log in 46:22 it you know it's been executing along 46:24 with the leader and you know for lab 3 46:26 it's got this key value table and you 46:29 know you do a get for key X and it's 46:31 gonna have some four key exodus table 46:34 and it can reply to you so sort of 46:36 functionally the replicas got all the 46:38 pieces it needs to respond to client to 46:40 read requests from clients the 46:44 difficulty is that there's no reason to 46:47 believe that anyone replicas other than 46:49 the leader is up to date because well 46:54 there's a bunch of reasons why why 46:56 replicas may not be up to date one of 46:57 them is that they may not be in the 46:59 majority that the leader was waiting for 47:02 you think about what raft is doing the 47:04 leader is only obliged to wait for 47:05 responses to its append entries from a 47:07 majority of the followers and then it 47:10 can commit the operation and go on to 47:11 the next operation so if this replica 47:14 wasn't in the majority it may never have 47:16 seen a riot it may be the network 47:18 dropped it and never got it and so yeah 47:20 you know the leader and you know a 47:25 majority of the servers have seen the 47:27 first three requests but you know this 47:30 server only saw the first two it's 47:31 missing B so read to be a read of you 47:35 know what should be there I'll just be 47:37 totally get a stale value from this one 47:40 even if this replica actually saw this 47:45 new log entry it might be missing the 47:47 commit command you know this zookeepers 47:50 app as much the same as raft it first 47:52 sends out a log entry and then when the 47:54 leader gets a majority of positive 47:55 replies the leader sends out a 47:57 notification saying yeah I'm gonna 47:58 committing that log entry I may not have 48:01 gotten the commit and the sort of worst 48:03 case version of this although its 48:04 equivalent to what I already said is 48:05 that for all this client for all client 48:08 to knows this replica may be partitioned 48:14 from the leader or may just absolutely 48:16 not be in contact with leader at all and 48:17 you know the follower doesn't really 48:19 have a way of knowing 48:20 that actually it's just been cut off a 48:23 moment ago from the leader and just not 48:25 getting anything so you know without 48:29 some further cleverness if we want to 48:32 build a linearizable system we can't 48:35 play this game of sending the attractive 48:37 it as it is for performance we can't 48:38 play this game at replicas sending a 48:40 read request to the replicas and you 48:43 shouldn't do it for lab 3 either because 48:44 that 3 is also supposed to be 48:47 linearizable it's any any questions 48:53 about why linearize ability forbids us 48:57 from having replicas serve clients ok 49:02 you know that the proof is the I lost it 49:07 now but the proof was that simple 49:11 reading you know right one right to read 49:13 one example I put on the board earlier 49:16 you not a lot just you know this is not 49:19 allowed to serve stale data in the 49:21 linear linearizable system ok so how 49:28 does how does ooh keep our deal with 49:29 this zookeeper actually does you can 49:31 tell from table two you look in Table 49:33 two zookeepers read performance goes up 49:35 dramatically as you add more servers so 49:38 clearly zookeepers playing some game 49:39 here which allows must be allowing it to 49:41 return read only to serve read only 49:44 requests from the additional servers the 49:46 replicas so how does ooh keeper make 49:50 this safe 49:59 that's right I mean in fact it's almost 50:01 not allowed to say it does need the 50:02 written latest yeah the way zookeeper 50:05 skins this cat is that it's not 50:06 linearizable right they just like to 50:09 find away this problem and say well 50:10 we're not gonna be we're not going to 50:12 provide linearizable reads and so 50:14 therefore you don't are not obliged 50:17 you know zookeepers not obliged to 50:20 provide fresh data to reads it's allowed 50:23 by its rules of consistency which are 50:25 not linearizable to produce stale data 50:28 for Wheaton's so it's sort of solved 50:31 this technical problem with a kind of 50:33 definitional wave of the wand by saying 50:37 well we never owed you them linearizable 50:38 it'll be in the first place so it's not 50:41 a bug if you don't provide it and that's 50:45 actually a pretty classic way to 50:46 approach this to approach the sort of 50:49 tension between performance and strict 50:53 and strong consistency is to just not 50:55 provide strong consistency nevertheless 50:58 we have to keep in the back of our minds 51:00 question of if the system doesn't 51:03 provide linearize ability is it still 51:07 going to be useful right and you do a 51:09 read and you just don't get the current 51:11 answer or current correct answer the 51:12 most latest data like why do we believe 51:14 that that's gonna produce a useful 51:16 system and so let me talk about that so 51:22 first of all any questions about about 51:26 the basic problem zookeeper really does 51:28 allow client to send read-only requests 51:30 to any replica and the replica responds 51:33 out of its current state and that 51:35 replicate may be lagging it's log may 51:37 not have the very latest log entries and 51:39 so it may return stale data even though 51:42 there's a more recent committed value 51:46 okay so what are we left with 51:51 zookeeper does actually have some it 51:55 does have a set of consistency 51:57 guarantees so to help people who write 52:01 zookeeper based applications reason 52:02 about what their applications what's 52:04 actually going to happen when they run 52:05 them so 52:07 and these guarantees have to do with 52:09 ordering as indeed linearise ability 52:10 does so zookeeper does have two main 52:15 guarantees that they state and this is 52:17 section 2.3 one of them is it says that 52:22 rights rights or linearizable now you 52:33 know there are notion of linearizable 52:34 isn't not quite the same in mine maybe 52:37 because they're talking about rights no 52:40 beads what they really mean here is that 52:43 the system behaves as if even though 52:48 clients might submit rights concurrently 52:50 nevertheless the system behaves as if it 52:52 executes the rights one at a time in 52:55 some order and indeed obeys real-time 52:59 ordering of right so if one right has 53:01 seen to have completed before another 53:03 right has issued then do keeper will 53:05 indeed act as if it executed the second 53:07 right after the first right so it's 53:09 rights but not reads are linearizable 53:12 and zookeeper isn't a strict readwrite 53:17 system there are actually rights that 53:20 imply reads also and for those sort of 53:23 mixed rights those those you know any 53:26 any operation that modifies the state is 53:29 linearizable with respect to all other 53:31 operations that modify the state the 53:37 other guarantee of gives is that any 53:42 given client its operations executes in 53:47 the order specified by the client 53:49 they call that FIFO client order 53:56 and what this means is that if a 53:58 particular client issues a right and 54:00 then a read and then a read and a right 54:02 or whatever that first of all the rights 54:05 from that sequence fit in in the client 54:09 specified order in the overall order of 54:13 all clients rights so if a client says 54:15 do this right then that right and the 54:18 third right in the final order of rights 54:21 will see the clients rates occur in the 54:24 order of the client specified so for 54:26 rights this is our client specified 54:32 order and this is particularly you know 54:38 this is a issue with the system because 54:40 clients are allowed to launch 54:41 asynchronous right requests that is a 54:44 client can fire off a whole sequence of 54:46 rights to the leader to the zookeeper 54:49 leader without waiting for any of them 54:51 to complete and in order resume the 54:53 paper doesn't exactly say this but 54:55 presumably in order for the leader to 54:57 actually be able to execute the clients 54:59 rights in the client specified order 55:00 we're imagining I'm imagining that the 55:03 client actually stamps its write 55:04 requests with numbers and saying you 55:07 know I'll do this one first this one 55:08 second this one third and the zookeeper 55:11 leader obeys that ordering right so this 55:14 is particularly interesting due to these 55:15 asynchronous write requests and for 55:19 reads this is a little more complicated 55:25 the reasons I said before don't go 55:27 through the writes all go through the 55:29 leader the reads just go to some 55:31 replicas and so all they see is the 55:33 stuff that happens to have made it to 55:35 that replicas log the way we're supposed 55:38 to think about the FIFO client order for 55:41 reads is that if the client issues a 55:43 sequence of reads again in some order 55:45 the client reads one thing and then 55:47 another thing and then a third thing 55:48 that relative to the log on the replicas 55:53 talking to those clients reads each have 55:59 to occur at some particular point in the 56:00 log or they need to sort of observe the 56:05 state as it as the state existed at a 56:07 particular point 56:08 the log and furthermore that the 56:11 successive reads have to observe points 56:14 that don't go backwards that is if a 56:17 client issues one read and then another 56:18 read and the first read executes at this 56:20 point in the log the second read is that 56:21 you know allowed to execute it the same 56:24 or later points in the log but not 56:26 allowed to see a previous state by issue 56:29 one read and then another read the 56:30 second read has to see a state that's at 56:32 least as up-to-date as the first state 56:34 and that's a significant fact in that 56:41 we're gonna harness when we're reasoning 56:43 about how to write correct zookeeper 56:45 applications and where this is 56:47 especially exciting is that if the 56:50 client is talking to one replica for a 56:52 while and it issues some reads issue to 56:54 read here and then I read there if this 56:56 replica fails and the client needs to 56:59 start sending its read to another 57:00 replica that guaranteed this FIFO client 57:03 or a guarantee still holds if the client 57:07 switches to a new replica and so that 57:08 means that if you know before a crash 57:10 the client did a read that sort of saw 57:13 state as of this point in the log that 57:16 means when the clients wishes to the new 57:18 replicas if it issues another read you 57:20 know it's its previous read executed 57:22 here 57:23 if a client issues another read that 57:25 read has to execute at this point or 57:27 later even though it's switched replicas 57:29 and you know the way this works is that 57:32 each of these log entries is tagged by 57:35 the leader tags it with a Z X ID which 57:39 is basically just a entry number 57:42 whenever a replica responds to a client 57:45 read request it you know executed the 57:47 request at a particular point and the 57:49 replica responds with the Z X ID of the 57:52 immediately preceding log entry back to 57:54 the client the client remembers this was 57:57 the exid of the most recent data you 58:00 know is the highest z x idea i've ever 58:01 seen and when the client sends a request 58:04 to the same or a different replica it 58:07 accompanies their request with that 58:09 highest CX ID has ever seen and that 58:11 tells this other replica aha you know i 58:14 need to respond to that request with 58:16 data that's at least relative to this 58:19 point in a log 58:21 and that's interesting if this you know 58:22 this replicas not up this second replica 58:25 is even less up to date yes was then 58:28 received any of these but it receives a 58:29 request from a client the client says oh 58:31 gosh the last read I did executed this 58:34 spot in the log and some other replica 58:36 this replica needs to wait until it's 58:38 gotten the entire log up to this point 58:41 before it's allowed to respond to the 58:42 client and I'm not sure exactly how that 58:46 works but either the replicas just 58:48 delays responding to the read or maybe 58:51 it rejects the read and says look I just 58:52 don't know the information talk to 58:53 somebody else or talk to me later 58:55 where's eventually the you know this 58:57 replica will catch up if it's connected 58:59 to the leader and then you won't be able 59:01 to respond 59:04 okay so reads are ordered they only go 59:06 forward in time or only go forward in 59:08 sort of log order and a further thing 59:12 which I believe is true about reason 59:14 rights is that reads and writes the FIFO 59:18 client order applies to all of a clients 59:20 all of a single clients requests so if I 59:22 do a write from a client and I send a 59:25 write to the leader it takes time before 59:28 that write is sent out committed 59:29 whatever so I may send it right to the 59:31 leader the leader hasn't processed it or 59:33 committed it yet and then I send a read 59:36 to a replica the read may have to stall 59:39 you know in order to guarantee FIFO 59:41 client order the read and they have to 59:43 stall until this client has actually 59:45 seen and executed the previous the 59:48 client's previous write operation so 59:52 that's a consequence of this type of 59:53 client order is that a reason rights are 59:55 in the same order and you know the way 59:58 the most obvious way to see this is if a 60:00 client writes a particular piece of data 60:03 you know sends a write to the leader and 60:05 then immediately does a read of the same 60:07 piece of data and sends that read to a 60:09 replica boy it better see its own 60:11 written value right if I write something 60:13 to have value 17 and then I do a read 60:16 and it doesn't have value 17 then that's 60:18 just bizarre and it's evidence that gosh 60:21 the system was not executing my requests 60:23 in order because then it would have 60:25 executed the write and then before the 60:27 read so there must be some funny 60:29 business with the replicas stalling 60:31 the client must when it sends a read and 60:33 say look you know I the last write 60:35 request I sent a leader with ZX ID 60:37 something in this replica has to wait 60:39 till it sees that I'm the leader yes 60:53 oh absolutely so I think what you're 60:56 observing is that a read from a replica 60:58 may not see the latest data so the 61:00 leader may have sent out C to a majority 61:03 of replicas and committed it and the 61:06 majority may have executed it but if our 61:08 replica that we're talking wasn't in 61:10 that majority maybe this replica doesn't 61:12 have the latest data and that just is 61:14 the way zoo keeper works and so it does 61:17 not guarantee that we'd see the latest 61:20 data so if there there is a guarantee 61:23 about readwrite ordering but it's only 61:25 per client so if I send a write in and 61:28 then I read that data the system 61:31 guarantees that my bead observes my 61:34 right if you send a right in and then I 61:37 read the data that you wrote this isn't 61:39 does not guarantee that I see your right 61:43 and that's and you know that's like the 61:46 foundation of how they get speed up for 61:50 reads proportional to the number of 61:51 replicas 61:58 but I would say the system isn't 62:00 linearizable and and but it is not that 62:04 it has no properties then the rights are 62:07 certainly many all right all rights from 62:09 all clients form some one at a time 62:11 sequence so that's a sense in which the 62:13 rights all rights are the knee risible 62:16 and each individual clients operations 62:21 may be this means linearizable also it 62:27 may you know this this probably means 62:29 that each individual clients operations 62:31 are linearize well though I'm not quite 62:32 sure you know I'm actually not sure how 62:48 it works but that's a reasonable 62:50 supposition then when I send in an 62:52 asynchronous right the system doesn't 62:54 execute it yet but it does reply to me 62:56 saying yeah you know I got your right 62:57 and here's this yaks ID that it will 62:59 have if it's committed I just like start 63:03 return so that's a reasonable theory I 63:04 don't actually know how it does it and 63:06 then the client if it doesn't read needs 63:11 to tell the replicas look you know 63:12 that's right I did you know if I do a 63:31 read of the data is of the operation 63:42 okay so if you send a read to a replica 63:43 the replicas in return you that you know 63:45 really it's a read from this table is 63:47 what your no way notionally what the 63:49 client thinks it's doing so you client 63:51 says all I want to read this row from 63:52 this table the server this replica sends 63:54 back its current value for that table 63:56 plus the GX ID of the last operation 64:00 that updated that table 64:06 yeah so there's so actually I'm I'm not 64:10 prepared to so the the two things that 64:13 would make sense and I think either of 64:14 them would be okay is the server could 64:17 track this yet for every table row the 64:20 ZX ID of the last right operation that 64:22 touched it or it could just to all read 64:25 requests returned the ZX ID as a last 64:27 committed operation in its log 64:29 regardless of whether that was the last 64:31 operation of touch that row because all 64:34 we need to do is make sure that client 64:36 requests move forward in the order so we 64:38 just need something to return something 64:40 that's greater than or equal to the 64:42 right that last touched the data that 64:45 the client read all right so these are 64:54 the guarantees so you know we still left 65:01 with a question of whether it's possible 65:02 to do reasonable programming with this 65:04 set of guarantees and the answer is well 65:06 this you know at a high level this is 65:08 not quite as good as linearizable it's a 65:11 little bit harder to reason about and 65:12 there's sort of more gotchas like reads 65:14 can return stale data just can't happen 65:15 in a linearizable system but it's 65:18 nevertheless good enough to do to make 65:21 it pretty straightforward to reason 65:22 about a lot of things you might want to 65:27 do with zookeeper so there's a so I'm 65:33 gonna try to construct an argument maybe 65:35 by example of why this is not such a bad 65:38 programming model one reason by the way 65:41 is that there's an out there's this 65:42 operation called sink which is 65:44 essentially a write operation and if a 65:47 client you know supposing I know that 65:49 you recently wrote something you being a 65:51 different client and I want to read what 65:53 you wrote so I actually want fresh data 65:54 I can send in one of these sink 65:57 operations which is effectively well the 66:03 sync operation makes its way through the 66:04 system as if it were a write and you 66:07 know finally showing up in the logs of 66:09 the replicas that really at least the 66:12 replicas that I'm talking to and then I 66:14 can come back and do a read and you know 66:18 I can I can tell the replica basically 66:20 don't serve this read until you've seen 66:23 my last sink and that actually falls out 66:26 naturally from fifl client order if we 66:29 if we countersink as a right then five-o 66:33 client order says reads are required to 66:34 see state you know there's as least as 66:37 up to date is the last right from that 66:39 client and so if I send in a sink and 66:41 then I do read I'm the the system is 66:45 obliged to give me data that's visas up 66:47 to date as where my sink fell in the log 66:49 order anyway if I need to read 66:52 up-to-date data send in a sink then do a 66:54 read and the read is guaranteed to see 66:57 data as of the time the same was entered 67:01 into the log so reasonably fresh so 67:05 that's one out but it's an expensive one 67:06 because you now we converted a cheap 67:08 read into the sink operation which 67:11 burned up time on the leader so it's a 67:14 no-no if you don't have to do but here's 67:17 a couple of examples of scenarios that 67:19 the paper talks about that the reasoning 67:23 about them is simplified or reasonably 67:25 simple given the rules that are here so 67:27 first I want to talk about the trick in 67:29 section 2.3 of with the ready file where 67:32 we assume there's some master and the 67:34 Masters maintaining a configuration in 67:36 zookeeper which is a bunch of files and 67:39 zookeeper that describe you know 67:41 something about our distributed system 67:43 like the IP addresses of all the workers 67:45 or who the master is or something so we 67:48 the master who's updating this 67:51 configuration and maybe a bunch of 67:52 readers that need to read the current 67:54 configuration and need to see it every 67:55 time it changes and so the question is 67:57 you know can we construct something that 67:59 even though updating the configure even 68:02 though the configuration is split across 68:03 many files in zookeeper we can have the 68:05 effect of an atomic update so that 68:09 workers don't see workers that look at 68:11 the configuration don't see a sort of 68:13 partially updated configuration but only 68:15 a completely updated that's a classic 68:19 kind of thing that this configuration 68:23 management that zookeeper people using 68:25 zookeeper for so you know looking at the 68:29 so we're copying what section 2.3 68:31 describes this will say the master is 68:34 doing a bunch of rites to update the 68:36 configuration and here's the order that 68:38 the master for our distributed system 68:41 does the rites 68:43 first we're assuming there's some ready 68:44 file a file named ready and if they're 68:47 ready file exists then the configuration 68:49 is we're allowed to read the 68:50 configuration if they're ready files 68:52 missing that means the configuration is 68:53 being updated and we shouldn't look at 68:55 it so if the master is gonna update the 68:58 configuration file the very first thing 68:59 it does is delete the ready file then it 69:07 writes the various files very zookeeper 69:10 files that hold the data for the 69:13 configuration might be a lot of files 69:15 nose and then when it's completely 69:17 updated all the files that make up the 69:19 configuration then it creates again 69:24 that's ready file 69:28 alright so so far the semantics are 69:31 extremely straightforward this is just 69:33 rights there's only rights here no reads 69:35 rights are guaranteed to execute in 69:37 linear order and I guess now we have to 69:42 appeal the fifl client order if the 69:44 master sort of tags these as oh you know 69:46 I want my rights to occur in this order 69:48 then the reader is obliged to enter them 69:52 into the replicated log in that order 69:53 and so though you know the replicas were 69:56 all dutifully execute these one at a 69:57 time they'll all delete the ready file 69:58 then apply this right in that right and 70:01 then create the ready file again so 70:03 these are rights the orders 70:05 straightforward for the reads though 70:08 it's it's maybe a little bit maybe a 70:13 little more thinking as required 70:14 supposing we have some worker that needs 70:16 to read the current configuration we're 70:21 going to assume that this worker first 70:25 checks to see whether the ready file 70:28 exists it doesn't exist it's gonna you 70:31 know sleep and try again so let's assume 70:33 it does exist let's assume we assume 70:35 that the worker checks to see 70:41 if the ready file exists after it's 70:44 recreated and so you know what this 70:46 means now these are all right requests 70:48 sent to the leader this is a read 70:49 request that's just centrally whatever 70:52 replica the clients talking to and then 70:56 if it exists you know it's gonna read f1 71:00 and B that - the interesting thing that 71:07 FIFO client order guarantees here is 71:10 that if this returned true that is if 71:17 the replica the client was talking to 71:18 said yes that file exists then you know 71:21 as were as that what that means is that 71:24 at least with this setup is that as that 71:27 replica that that replica had actually 71:32 seen the recreate of the ready file 71:33 right in order for this exist to see to 71:38 see the ready file exists and because 71:41 successive read operations are required 71:44 to march along only forwards in the long 71:47 and never backwards that means that you 71:49 know if the replicas the client was 71:52 talking to if it's log actually 71:54 contained and then it executes this 71:56 creative the ready file that means that 71:58 subsequent client reads must move only 72:02 forward in the sequence of rights you 72:07 know that the leader put into the log so 72:09 if we saw this ready that means that the 72:11 read occurs that the replica excuse to 72:13 read down here somewhere after the right 72:16 that created the ready and that means 72:18 that the reads are guaranteed to observe 72:19 the effects of these rights so we do 72:22 actually get some benefit here some 72:24 reasoning benefit from the fact that 72:25 even though it's not fully linearizable 72:28 the rights are linearizable and the 72:30 reads have to read sort of monotonically 72:32 move forward in time to the log yes 72:38 [Music] 72:49 yeah so that's a great question so your 72:52 question is well in all this client 72:54 knows you know if this is the real 72:56 scenario that the creators entered in 72:58 the log and then the read arrives at the 73:01 replica after that replica executed this 73:03 creepy ready then everything's 73:04 straightforward but there's other 73:06 possibilities for how this stuff was 73:07 interleaved 73:08 so let's look at a much more troubling 73:11 scenario so the scenario you brought up 73:21 which I happen to be prepared to talk 73:24 about is that yeah you know the the 73:28 master at some point executed to a 73:31 delete of ready or you know way back in 73:36 time some previous master this master 73:40 created the ready file 73:41 you know after it finished updating the 73:44 state I say ready for I existed for a 73:46 while then some new master or this 73:48 master needs to change the 73:48 configurations release the ready file 73:50 you know it doesn't right right and 73:56 what's really troubling is that the 73:58 client that needs to read this 74:00 configuration might have called exists 74:02 to see whether the ready file exists at 74:06 this time all right and you know at this 74:12 point in time yeah sure the ready file 74:14 exists then time passes and the client 74:16 issues the reads for the maybe the 74:18 client reads the first file that makes 74:22 up the configuration but maybe it you 74:25 know and then it reads the second file 74:26 maybe this file this read comes totally 74:29 after the master has been changing the 74:32 configurations so now this reader has 74:35 read this damaged mix of f1 from the old 74:38 configuration and f2 from the new 74:40 configuration there's no reason to 74:42 believe that that's going to contain 74:44 anything other than broken information 74:46 so so this first scenario was great the 74:49 scenario is a disaster and so now we're 74:52 starting to get into 74:54 of like serious challenges which a 74:57 carefully designed API for coordination 75:01 between machines in a distributed system 75:05 might actually help us solve right 75:07 because like for lab 3 you know you're 75:09 gonna build a put get system and a 75:11 simple lab 3 style put guessed system 75:13 you know it would run into this problem 75:15 too and just does not have any tools to 75:17 deal with it 75:18 but the zookeeper API actually is more 75:21 clever than this and it can cope with it 75:23 and so what actually happens the way you 75:27 would actually use ooh keeper is that 75:29 when the client sent in this exists 75:32 request to ask does this file exist and 75:35 would say not only does this file exist 75:37 but it would say you know tell me if it 75:41 exists even set a watch on that file 75:44 which means if the files ever deleted or 75:47 if it doesn't exist if it's ever created 75:48 but in this case if it if it is ever 75:51 deleted please send me a notification 75:56 and furthermore the notifications that 76:01 zookeeper sends you know it's a the 76:04 reader here it's only talking to some 76:05 replicas this is all the replicas doing 76:08 these things for it the replica 76:09 guarantees to send a notification for 76:13 some change to this ready file at the 76:16 correct point relative to the responses 76:20 to the clients reads and so what that 76:25 means so you know because that the the 76:32 implication of that is that in this 76:34 scenario in which you know these these 76:38 rights sort of fit in here in real time 76:40 the guarantee is that if you ask for a 76:44 watch on something and then you issue 76:45 some reads if that replica you're 76:49 talking to execute something that should 76:51 trigger the watch in during your 76:53 sequence of reads then the replica 76:57 guarantees to deliver the notification 76:59 about the watch before it responds to 77:02 any read that came that you know saw the 77:05 log after the point 77:07 of the OP where the operation that 77:10 triggered the watch notification 77:12 executed and so this is the log on the 77:15 replica and so you know if the so that 77:18 you know the FIFO client ordering will 77:21 say you know each client requests must 77:23 fit somewhere into the log apparently 77:25 these fit in here in the log what we're 77:27 worried about is that this read occurs 77:29 here in the log but we set up this watch 77:32 and the guarantee is that will receive 77:34 the note if if somebody deletes this 77:36 file and we can notified then that 77:39 notification will will appear at the 77:40 client before a read that yields 77:43 anything subsequently in the log will 77:48 get the notification before we get the 77:49 results of any read that's that saw 77:52 something in log after the operation 77:54 that produced the notification so what 77:57 this means that the delete ready is 77:58 gonna since we have a watch on the ready 78:00 file that elite ready is going to 78:02 generate a notification and that 78:05 notification is guaranteed to be 78:07 delivered before the read result of f2 78:10 if f2 was gonna see this second right 78:13 and that means that before the reading 78:15 client has finished the sequence in 78:17 which it looks at the configuration it's 78:19 guaranteed to see the watch notification 78:23 before it sees the results of any write 78:26 that happened after this delete that 78:29 triggered the notification 78:39 who generates the watch as well the 78:42 replica let's say the client is talking 78:43 to this replica and it sends in the 78:45 exists request the exist room has a read 78:48 only request it sends with his replica 78:49 the replica is being painting on the 78:51 side a table of watches saying oh you 78:54 know such-and-such a client asked for a 78:55 watch on this file and furthermore the 78:59 watch was established at a particular Z 79:01 X ID that is did a read that client did 79:03 a read with the replica executed the 79:05 read at this point in the log and return 79:07 results are relative to this point in 79:09 the log members owe that watch is 79:12 relative to that point in the log and 79:14 then if a delete comes in you know for 79:17 every operation that there s Q so it 79:20 looks in this little table it says aha 79:21 you know the a there was a watch on that 79:24 file and maybe it's indexed by hash of 79:27 filename or something 79:37 okay so the question is oh yeah this 79:39 this replica has to have a watch table 79:41 you know if the replica crashes and the 79:46 client is officially different replica 79:48 you know what about the watch table 79:50 right it's already established these 79:51 watch and the answer to that is that no 79:52 the rep your replica crashes the new 79:56 replica you switch to won't have the 79:58 watch table and but the client gets a 80:01 notification at the appropriate point in 80:03 in the stream of responses it gets back 80:06 saying oops your replica you were 80:08 talking to you crashed and so the client 80:11 then knows it has to completely reset up 80:13 everything and so tucked away in in the 80:16 examples are missing event handlers to 80:20 say oh gosh you know we need to go back 80:21 and we establish everything if we get a 80:24 notification that our replicas crashed 80:26 all right I'll continuous