字幕記錄 00:02 all right today I want to talk about bit 00:07 more about fault tolerance and 00:08 replication and then look into the 00:10 details of today's paper about vmware ft 00:13 the topics still fault tolerance to 00:16 provide high availability that is you 00:19 want to build a server that even if some 00:20 hardware you know computer crashes is 00:22 involved in the service we still like to 00:24 provide the service and to the extent we 00:27 can we'd like to provide our service 00:29 also if there's network problems and the 00:31 tool we're using its replication least 00:33 for this part of the course so it's 00:36 worth asking what kind of failures 00:39 replication can be expected to deal with 00:42 because it's not everything by any means 00:52 so maybe the easiest way to characterize 00:55 the kind of failures we're talking about 00:57 is fail stop failures of a single 01:01 computer and what I mean by fail stop 01:11 it's a sort of generic term and fault 01:14 tolerance is that if something goes 01:16 wrong would say the computer the 01:18 computer simply stops executing it just 01:23 stops if anything goes wrong and in 01:27 particular it doesn't compute incorrect 01:29 results so if somebody kicks the power 01:31 cable out of your server that's probably 01:35 gonna generate a fail stop failure 01:36 similarly if they unplug your servers 01:41 network connection even though the 01:42 server is still running so this is a 01:44 little bit funny you know be totally cut 01:46 off from the network so it looks at me 01:47 outside like it just stopped so it's 01:50 really these failures we can deal with 01:51 with replication this also covers some 01:55 hardware problems like you know maybe if 01:59 the fan on your server breaks because it 02:02 you know it cost 50 cents maybe that'll 02:04 cause the CPU to overheat and the CPU 02:06 will shut itself down cleanly and just 02:10 stop executing 02:14 what's not covered by the kind of 02:18 replication systems we're talking about 02:20 is things like bugs and software or 02:24 design defects in hardware so basically 02:29 not bugs because if we take some service 02:33 you know say you're a MapReduce master 02:35 for example you know we replicated and 02:37 run it on two computers you know if 02:39 there's a bug in your MapReduce master 02:41 or my MapReduce master let's say 02:43 replications not going to help us we're 02:45 going to compute the same incorrect 02:46 result on both of our copies of our 02:50 MapReduce master and everything looked 02:53 fine they'll agree you just happen to be 02:55 the wrong answer so we can't depending 02:58 against bugs in the replicated software 03:01 and we can't defend against bugs in the 03:02 whatever scheme we're using to manage 03:05 the replication and similarly as I 03:08 mentioned before we can't expect to deal 03:12 with bugs in the hardware the hardware 03:14 it computes incorrectly that's just 03:15 that's the end for us at least with this 03:18 kind of technique although you know that 03:21 said there are definitely hardware and 03:23 software bugs that that replication 03:26 might if you're lucky might be able to 03:28 cope it so if there's some unrelated 03:30 software running in your server and it 03:32 causes the server to crash maybe because 03:34 your kernel to panic and reboot or 03:36 something it has nothing to do with you 03:38 know with your with the service you're 03:40 replicating then that kind of failure 03:42 for us for your service will may well be 03:45 fail stop 03:47 you know the kernel will panic and the 03:50 backup replicas will take over similarly 03:56 some kinds of hardware errors can be 03:59 turned into fail stop errors for example 04:03 if you send a packet over the network 04:05 and the network corrupts it just flips a 04:08 bit in your packet that will almost 04:10 certainly be caught by the checksum on 04:11 the packet same thing for a disk block 04:13 if you write some data to disk and weand 04:15 it back a month later you know maybe the 04:18 magnetic surface isn't perfect and you 04:20 know one of the best couple of bits were 04:22 wrong in the block as it's right back 04:23 it's actually error correcting 04:24 that up to a certain point will fix 04:26 errors in disk blocks that you'll be 04:28 turning you know random hardware errors 04:31 into as either correcting them if you're 04:35 super lucky or at least detecting them 04:39 and turning random corruption into a 04:40 detected fault which you know the 04:43 software then knows that something that 04:45 wrong and can turn it into a fail stop 04:47 fault by stopping executing or take some 04:49 other remedial action but in general we 04:55 really can only expect to handle fail 04:57 stop faults there's other limits to 05:02 replication to you know the the failures 05:06 in the if we have a primary in the back 05:08 of our two replicas or whatever we're 05:10 really assuming that failures in the two 05:12 are independent right if there tend to 05:16 have correlated failures then 05:18 replication is not going to help us so 05:20 for example if we're a big outfit and we 05:22 buy thousands of computers batches of 05:24 thousands of computers identical 05:26 computers from the same manufacturer and 05:28 we run you know our replicas is on all 05:31 on those computers we bought at the same 05:33 time from the same place that's a bit of 05:35 a risk 05:35 maybe because presumably if one of them 05:39 has a manufacturing defect in it there's 05:40 a good chance that the other ones do too 05:43 you know one of them's prone to 05:44 overheating because the manufacturer you 05:47 know didn't provide enough airflow well 05:49 it probably all had that problem and so 05:51 one of them overheats and dies it's a 05:53 good chance that the other ones will too 05:56 so that's one kind of correlated failure 05:59 you just have to be careful of another 06:01 one is that you know if there's an 06:02 earthquake and the city where our 06:03 datacenter is probably gonna take out 06:05 the whole data center you know we can 06:07 have all the replication we like inside 06:08 that data center it's not going to help 06:10 us because the failure caused by an 06:12 earthquake or a citywide power failure 06:14 or something the building burning down 06:15 is like it's correlated failure between 06:17 our replicas if they're on that building 06:19 so if we care about dealing with 06:21 earthquakes then we need to put our 06:24 replicas in maybe in just different 06:26 cities at least physically separate 06:28 enough that they have separate power 06:29 unlikely to be affected by the same 06:31 natural disaster 06:35 okay but that's all sort of hovering in 06:37 the background for this discussion where 06:39 we're talking about the technology you 06:41 might use another question about 06:44 replication is whether it's worthwhile 06:46 you may ask yourself gosh you know this 06:49 literally uses these replication schemes 06:51 use twice as much or three times as much 06:55 computer resources right we need to have 06:57 you know GFS had three copies of every 06:59 box we have to buy three times as much 07:01 disk space the paper for today 07:03 you know replicates just once but that 07:05 means we have twice as many computers 07:07 and CPUs and RAM it's all for expensive 07:10 like is that really worth it that 07:11 expense and you know that's not 07:16 something we can answer technically 07:17 right it's an economic question it 07:19 depends on the value of having an 07:21 available service you know if you're 07:22 running a bank and if the consequence is 07:26 the computer failing is that your 07:27 customer you can't serve your customers 07:29 and you can't generate revenue and your 07:31 customers all hate you then it may well 07:33 be worth it to blow you know an extra 07:35 ten or twenty thousand bucks on a second 07:37 computer so you can have a replica on 07:40 the other hand if you're me and you're 07:41 running the 684 web server I don't 07:46 consider it worthwhile to have a hot 07:48 backup of the 84 web server because the 07:50 consequences of failure are very low so 07:55 the whether the replication is 07:58 worthwhile on how many replicas you 08:00 ought to have and how much you're 08:02 willing to spend on it is all about how 08:04 much cost and inconvenience failure 08:08 would call it cause you all right this 08:11 paper sort of in the beginning mentions 08:14 as there's a couple of different 08:16 approaches to replication really 08:19 mentions two one two calls state 08:21 transfer and the other calls replicated 08:30 state machine most of the schemes we're 08:35 going to talk about in this class are 08:36 replicated state machines 08:39 it'll talk about both anyway the idea 08:42 behind state transferor's that if we 08:44 have two replicas of a server the way 08:49 you cause them to be to stay in sync 08:51 that is to be actual replicas so that 08:55 the backup can has everything it needs 08:57 to take over if the primary fails in a 08:59 state transfer scheme the way that works 09:02 is that the primary sends a copy of its 09:04 entire state that is for example the 09:06 contents of its RAM to the backup and 09:09 the backup just sort of stores the 09:11 latest state and so it's all there 09:13 the primary fails in the backup can 09:15 start executing with this last state it 09:18 got if the primary fails so this is all 09:20 about sending the state of the of the 09:23 primary and for today's if today's paper 09:26 worked as a state transfer system which 09:28 it doesn't then the state we'd be 09:29 talking about would be the contents of 09:31 the RAM the contents of the memory of 09:33 the primary so maybe every once while 09:35 the primary would just you know make a 09:38 big copy of its memory and send it 09:39 across the network to the backup you can 09:41 imagine if you wanted to be efficient 09:42 you know maybe you would only send the 09:44 parts of the memory that it's changed 09:45 since the last time you sent in memory 09:48 to the backup the replicated state 09:51 machine 09:52 this approach observes that most 09:55 services are most computer things we 09:57 want to replicate have some internal 10:00 operation that's deterministic except 10:05 when external input comes in right you 10:08 know ordinarily if there's no external 10:10 influences on a computer it just 10:12 executes one instruction after another 10:14 and what each instruction does is a 10:16 deterministic function of what's in the 10:18 memory and the registers of the computer 10:20 and it's only when external events 10:22 intervene that something unexpected may 10:25 happen like a packet arrives of a some 10:27 random time and that causes the server 10:31 to start doing something differently I'm 10:33 so replicated state machine schemes 10:36 don't send the state between the 10:39 replicas instead they just send those 10:41 external events they just send maybe 10:45 from a primary to a backup again just 10:47 send things like arriving input from the 10:50 outside world that the backup needs to 10:52 know 10:52 and the observation is that you know if 10:55 you have to two computers and they start 10:57 from the same state and they see the 11:00 same inputs that that in the same order 11:03 or at the same time the two computers 11:05 will continue to be replicas of each 11:08 other and sort of execute identically as 11:10 long as they both see the same inputs at 11:12 the same time so this transfers probably 11:17 memory and this transfer some primary 11:21 backup just operations from clients or 11:25 external external inputs or external 11:27 events and you know the reason why 11:33 people tend to favor a replicated state 11:35 machine is that usually operations are 11:39 smaller than the state but this you know 11:41 the state of a server if it's a database 11:43 server might be the entire database 11:44 might be you know gigabytes whereas the 11:47 operations are just some clients sending 11:48 and you know please read or write key 27 11:51 operations are usually small the states 11:53 usually large so replicate a state 11:55 machine usually looks attractive and 11:57 slight downside is that the schemes tend 11:59 to be quite a bit more complicated and 12:01 rely on sort of more assumptions about 12:05 how the computers operate whereas this 12:08 is a really heavy-handed I'm just gonna 12:10 send you my whole state sort of a 12:12 nothing to worry about 12:14 any questions about these strategies yes 12:27 well the did ok so the question is 12:30 suppose something went wrong with our 12:32 scheme and the backup was not actually 12:34 identical to the primary so you know 12:40 you're suppose we were running GFS 12:44 master and it's the primary it just 12:47 handed out at least two chunks server 12:49 one but because the two you know because 12:55 we've allowed the states of the primary 12:56 back to drift out of sync the backup did 12:59 not issue at least to anybody it wasn't 13:01 even away or anybody had asked for these 13:02 so now the primary thinks you know 13:04 chunks everyone has lease for some chunk 13:05 in the backup doesn't the primary fails 13:08 backup takes over right now chunks over 13:11 one thinks it has a lease for some chunk 13:13 but then the current master doesn't and 13:17 is happy to hand out the lease to some 13:19 other trunk server now we have to chunk 13:20 servers serving the same beasts okay so 13:23 that's just a close to home example but 13:25 really you know almost any bad thing and 13:28 kind of I think you construct any bad 13:30 scenario by just imagining some service 13:32 that confuse the wrong answer because 13:35 the state's leverage 13:42 so you're asking about randomization 13:50 yeah oh y'all talk about this I'll talk 13:53 about this a bit later on but it is good 13:55 that the replicated state scheme 13:58 definitely makes the most sense when the 14:02 instructions that the primary in the 14:04 back of our executing do the same thing 14:06 as long as there's no external events 14:08 right and that's almost true right you 14:10 know for an add instruction or something 14:12 yeah you know if the starting if the 14:14 registers of memory of the same and they 14:16 both execute an add instruction they had 14:17 insurgents kind of the same inputs in 14:19 the same outputs but they're in some 14:20 instructions as you point out that don't 14:22 like maybe there's an instruction that 14:23 gets the current time of day now 14:26 probably be executed at slightly 14:27 different times or an instruction that 14:29 gets the current processors unique ID 14:32 and a serial number it's going to yield 14:34 the different answers and the the the 14:38 uniform answered the questions that 14:39 sound like this is that the primary does 14:42 it and sends the answer to the backup 14:44 and the backup does not execute that 14:46 instruction but instead at the point 14:48 where it would execute that instruction 14:50 it listens for the primary to tell it 14:52 what the right answer would be and just 14:54 sort of fakes that answer to the 14:56 software I'll talk about you know how 15:00 the VMware scheme does that okay 15:04 interestingly enough though today's 15:06 paper is all about a replicated state 15:09 machine you may have noticed that 15:12 today's paper only deals with you know 15:13 processors and it's not that clear how 15:15 it could be extended to a multi-core and 15:18 a multi-core machine where the 15:21 interleavings of the instructions from 15:23 the two cores organ are 15:24 non-deterministic all right so we no 15:26 longer have this situation on a 15:27 multi-core machine where if we just let 15:29 the primary and backup execute they're 15:31 you know all else being equal they're 15:32 going to be the same because they won't 15:34 execute on multiple cores VMware has 15:37 since come out with a new possibly 15:39 completely different replication system 15:42 that does work on multi-core and the new 15:44 system appears to me to be using state 15:46 transfer instead of replicated state 15:49 machine because state transferred is 15:50 more robust in the face 15:53 multi-core and parallelism if you use 15:56 the machine and send the memory over you 15:58 know that the memory image is just that 16:00 just is the state of the machine and 16:02 sort of it doesn't matter that there was 16:04 parallelism whereas the replicated state 16:06 machine scheme really has a problem with 16:08 the parallelism you know on the other 16:12 hand I'm guessing that this new 16:15 multi-core scheme is more expensive okay 16:21 all right so if we want to build a 16:24 replicated state machine scheme we got a 16:26 number of questions to answer so we need 16:31 to decide at what level we're gonna 16:32 replicate state right so what state what 16:36 do we mean by state we have to worry 16:44 about how how closely synchronized the 16:47 primary and backup have to be right 16:49 because it's likely the primary will 16:51 execute a little bit ahead of the backup 16:53 after all it it's the primary that sees 16:55 the inputs so the backup almost 16:57 necessarily must lag over that gives 17:00 that means there's an opportunity if the 17:01 primary fails for the prime for the 17:04 backup not to be fully caught up having 17:08 the backup actually executes really in 17:11 lockstep with the primaries for 17:12 expensive because it requires a lot of 17:14 chitchat so a lot of designs a lot of 17:16 what people sweat about is how close the 17:19 synchronization is if the primary fails 17:27 or you know actually if the backup fails 17:29 to but it's more exciting if the primary 17:31 fails there has to be some scheme for 17:33 switching over and the clients have to 17:34 know oh gosh I instead of talking to the 17:37 old primary on server one I should now 17:39 be talking to the 17:44 the backup on server to all the clients 17:47 have to somehow figure this out the 17:50 switch over almost certainly it's almost 17:53 impossible maybe impossible to design a 17:55 cut over system in which no anomalies 17:58 are every are ever visible you know in 18:00 this sort of ideal world if the primary 18:03 feels we'd like nobody to ever notice 18:05 none of the clients to notice turns out 18:07 that's basically unattainable so there's 18:10 going to be a mama leaves during the cut 18:15 over and we've gotta figure out a way to 18:16 cope with them and finally if the one of 18:19 the two if one of our replicas fails we 18:21 really need to have a new replica right 18:23 if we have a two replicas and one fails 18:26 we're just living on borrowed time right 18:29 because the second replica may fail at 18:31 some point so we absolutely need to get 18:33 a new replica back online as fast as 18:36 possible so and that can be very 18:41 expensive the state is big you know you 18:44 know but the reason we like to replicate 18:45 a state machine was because we thought 18:47 state transfer would be expensive but 18:49 the two replicas in a replicated state 18:51 machine still need to have full state 18:53 right we just had a cheap way of keeping 18:55 them both in sync if we need to create a 18:57 new replica we actually have no choice 18:59 but state transfer to create the new 19:01 replicas the new replica needs to have a 19:03 complete copy of the state so it's going 19:06 to be expensive to create new replicas 19:08 and this is often people spending well 19:15 actually people spend a lot of time 19:16 worrying about all these questions and 19:18 you know we'll see them again as we look 19:20 at other replicated state machine 19:22 schemes so on the topic of what state to 19:29 replicate the today's paper has a very 19:33 interesting answer to this question it 19:35 replicates the full state of the machine 19:38 that is all of memory and all the 19:42 Machine registers it's like a very very 19:45 detailed replication scheme just no 19:48 difference at the even of the lowest 19:51 levels between the primary in the backup 19:52 that's quite rare for replication 19:55 schemes 19:56 almost always you see something that's 19:58 more like GFS where GFS absolutely did 20:01 not replicate you know they had 20:03 replication but it wasn't replicating 20:05 every single you know bit of memory 20:08 between the primaries and the backups 20:10 it was replicating much more application 20:12 level table of chunks 20:14 I had this abstraction of you know 20:16 chunks and chunk identifiers and that's 20:18 what it was replicating it wasn't 20:20 replicating sort of everything else 20:22 wasn't going to the expense of 20:24 replicating every single other thing 20:26 that machines we're doing okay as long 20:28 as they had the same sort of application 20:31 visible set of of chunks so most 20:37 replication schemes out there go the GFS 20:40 route in fact almost everything except 20:42 pretty much this paper and a few handful 20:46 of similar systems almost everything 20:48 uses application at some level 20:50 application level of replication because 20:53 it can be much more efficient because we 20:56 don't have to go to the we don't have to 20:58 go to the trouble of for example making 21:00 sure that interrupts occur at exactly 21:02 the same point in the execution of the 21:04 primary and backup GFS does not sweat 21:07 that at all but this paper has to do 21:09 because it replicates at such a low 21:11 level so most people build efficient 21:14 systems with applications specific 21:16 replication the consequence of that 21:18 though is that the replication has to be 21:20 built into the right into the 21:21 application right if you're getting a 21:23 feed of application level operations for 21:26 example you really need to have the 21:28 application participate in that because 21:31 some generic replication thing like 21:33 today's paper 21:34 doesn't really can't understand the 21:37 semantics of what needs to be replicated 21:41 so anyways so most teams are application 21:44 specific like GFS and every other paper 21:47 we're going to read on this topic 21:49 today's paper is unique in that it 21:52 replicates at the level of the machine 21:54 and therefore does not care what 21:55 software you run on it right it 21:57 replicates the low-level memory and 22:00 machine registers you can run any 22:01 software you like on it as long as it 22:03 runs on that kind of microprocessor 22:05 that's being represented this 22:06 replication scheme applies to the 22:08 software can be anything 22:10 and you know the downside is that it's 22:14 not that efficient necessarily the 22:16 upside is that you can take any existing 22:18 piece of software maybe you don't even 22:20 have source code for it or understand 22:21 how it works and you know do within some 22:26 limits you can just run it under this 22:27 under VMware this replication scheme and 22:29 it'll just work which is sort of magic 22:33 fault-tolerance wand for arbitrary 22:36 software all right now let me talk about 22:44 how this is VMware or Ft first of all 22:51 VMware is a virtual machine company 22:53 they're what their business is a lot of 22:56 their business is selling virtual 22:58 machine technology and what virtual 23:00 machines refer to is the idea of you 23:04 know you buy a single computer and 23:07 instead of booting an operating system 23:09 like Linux on the hardware you boot 23:12 we'll call a virtual machine monitor or 23:16 hypervisor on the hardware and the 23:18 hypervisor is job is actually to 23:19 simulate multiple multiple computers 23:24 multiple virtual computers on this piece 23:27 of hardware so the virtual machine 23:28 monitor may boot up you know one 23:31 instance of Linux may be multiple 23:34 instances of Linux may be a Windows 23:37 machine you can the virtual machine 23:40 monitor on this one computer can run a 23:42 bunch of different operating systems you 23:45 know each of these as is itself some 23:49 sort of operating system kernel and then 23:51 applications so this is the technology 23:55 they're starting with and you know the 23:58 reason for this is that if you know you 24:00 need to it just turns out there's many 24:03 many reasons why it's very convenient to 24:04 kind of interpose this level of 24:06 indirection between the hardware and the 24:08 operating systems and means that we can 24:10 buy one computer and run lots of 24:11 different operating systems on it we can 24:14 have each if we run lots and lots of 24:16 little services instead of having to 24:18 have lots and lots of computers one per 24:19 service you can just buy one computer 24:21 and run each service in the operate 24:23 system that it needs I'm using his 24:25 personal machines so this was their 24:28 starting point they already had this 24:29 stuff and a lot of sophisticated things 24:31 built around it at the start of 24:35 designing vmware ft so this is just 24:38 virtual machines um what the papers 24:43 doing is that it's gonna set up one 24:46 machine or they did requires two 24:51 physical machines because there's no 24:54 point in running the primary and backup 24:57 software in different virtual machines 24:59 on the same physical machine because 25:01 we're trying to guard against hardware 25:03 failures so you're gonna to at least you 25:06 know you have two machines running their 25:08 virtual machine monitors and the primary 25:15 it's going to run on one the backups and 25:16 the other so on one of these machines we 25:18 have a guest you know we only it might 25:23 be running a lot of virtual machines we 25:25 only care about one of them it's gonna 25:26 be running some guest operating system 25:28 and some sort of server application 25:32 maybe a database server MapReduce master 25:35 or something so I'll call this the 25:37 primary and there'll be a second machine 25:40 that you know runs the same virtual 25:43 machine monitor and an identical virtual 25:47 machine holding the backup so we have 25:49 the same whatever the operating system 25:50 is exactly the same and the virtual 25:55 machine is you know giving these guest 25:58 operating systems the primary and backup 26:00 a each range of memory and this memory 26:02 images will be identical or the goal is 26:04 to make them identical in the primary in 26:07 the backup we have two physical machines 26:09 each one of them running a virtual 26:13 machine guest with a its own copy of the 26:15 service we care about we're assuming 26:17 that there's a network connecting these 26:22 two machines and in addition on this 26:25 local area network in addition on this 26:27 network there's some set of clients 26:29 really they don't have to be clients 26:30 they're just maybe other computers that 26:33 our replicated service needs to talk 26:35 with some of them our clients 26:37 sending requests it turns out in this 26:39 paper there the replicated service 26:44 actually doesn't use a local disk and 26:47 instead assumes that there's some sort 26:49 of disk server that it talks to him 26:53 although it's a little bit hard to 26:55 realize this from the paper the scheme 26:59 actually does not really treat the de 27:01 server particularly especially it's just 27:04 another external source of packets and 27:07 place that the replicated state machine 27:09 may send packets do not very much 27:12 different from clients okay so the basic 27:17 scheme is that the we assume that these 27:20 two replicas the two virtual machines 27:24 primary and backup are our exact 27:27 replicas some client you know database 27:30 client who knows who has some client of 27:31 our replicated server sends a request to 27:33 the primary and that really takes the 27:37 form of a network packet that's what 27:38 we're talking about that generates an 27:40 interrupt 27:41 and this interrupts actually goes to the 27:43 virtual machine monitor at least in the 27:45 first instance the virtual machine 27:47 monitor sees a hot here's the input for 27:50 this replicated service and so the 27:54 virtual machine monitor does two things 27:55 one is it sort of simulates a network 27:58 packet arrival interrupt into the 28:01 primary guest operating system to 28:04 deliver it to the primary copy of the 28:07 application and in addition the virtual 28:09 machine monitor you know knows that this 28:11 is an input to a replicated virtual 28:13 machine and it's so it sends back out on 28:15 the network a copy of that packet to the 28:19 backup virtual machine monitor it's also 28:23 guessing and backup virtual machine 28:26 monitor knows a hot is a packet for this 28:28 particular replicated state machine and 28:30 it also fakes a sort of network packet 28:34 arrival interrupt at the backup and 28:36 delivers the packet so now both the 28:39 primary and the back have a copy this 28:40 packet they looks at the same input you 28:42 know with a lot of details are gonna 28:44 process it in the same way and stay 28:48 synchronized 28:50 course the service is probably going to 28:52 reply to the client on the primary the 28:55 service will generate a reply packet and 28:58 send it on the NIC that the virtual 29:02 machine monitor is emulating and then 29:06 the virtual machine monitor or will 29:07 we'll see that output packet on the 29:09 primary they'll actually send the reply 29:11 back out on the network to the client 29:13 because the backup is running exactly 29:16 the same sequence of instructions it 29:17 also generates a reply packet back to 29:20 the client and sends that reply packet 29:23 on its emulated NIC it's the virtual 29:27 machine monitor that's emulating that 29:28 network interface card and it says aha 29:31 you know the virtual machine monitor 29:32 says I know this was the backup only the 29:34 primary is allowed to generate output 29:35 and the virtual machine monitor drops 29:39 the reply packet so both of them see 29:42 inputs and only the primary generates 29:44 outputs as far as terminology goes the 29:53 paper calls this stream of input events 29:59 and other things other events we'll talk 30:01 about from the stream is called the 30:04 logging Channel it all goes over the 30:06 same network presumably but these events 30:10 the primary since the back of our called 30:12 log events on the log Channel 30:22 where the fault tolerance comes in is 30:24 that those the primary crashes what the 30:29 backup is going to see is that it stops 30:31 getting stuff on the stops getting log 30:34 entries a log entry stops getting log 30:37 entries on the logging channel and we 30:42 know it it turns out that the backup can 30:45 expect to get many per second because 30:47 one of the things that generates log 30:49 entries is periodic timer interrupts in 30:52 the in the primary each one of which 30:55 turns out every interrupt generates a 30:57 log entries into the backup these timer 30:59 interrupts are going to happen like 100 31:01 times a second so the backups can 31:02 certainly expect to see 31:04 a lot of chitchat on the logging Channel 31:07 if the primaries up if the primary 31:09 crashes then the virtual machine 31:11 monitored over here will say gosh you 31:12 know I haven't received anything on the 31:14 logging channel for like a second or 31:15 however long the primary must be dead or 31:19 or something and in that case when the 31:25 backup stop seeing log entries from the 31:28 primary the paper the way the paper 31:31 freezes it is that the back of goes live 31:33 and what that means is that it stops 31:35 waiting for these input events on the 31:42 logging Channel from the primary and 31:46 instead this virtual machine monitor 31:49 just lets this backup execute freely 31:51 without waiting for without being driven 31:54 by input events from the primary the vmm 31:59 does something to the network to cause 32:00 future client requests to go to the 32:02 backup instead of the primary and the 32:05 VMM here stops discarding the backup 32:09 personnel it's the primary not the 32:11 backup stops discarding output from this 32:13 virtual machine so now this or machine 32:15 directly gets the inputs and there's a 32:18 lot of produce output and now our backup 32:20 is taken over and similarly you know 32:22 that this is less interesting but has to 32:25 work correctly 32:26 if the backup fails a similar primary 32:29 has to use a similar process to abandon 32:31 the backup stop sending it events and 32:34 just sort of act much more like a single 32:37 non replicated server so either one of 32:39 them can go live if the other one 32:41 appears to be dead stops you know stops 32:43 generating network traffic 32:51 magic now it depends you know depends on 32:57 what the networking technology is I 33:01 think with the paper one possibility is 33:04 that this is sitting on Ethernet every 33:07 physical computer on the Internet or 33:09 really every NIC has a 48 bit unique ID 33:16 I'm making this up now the it could be 33:21 that in fact instead of each physical 33:22 computer having a unique ID each virtual 33:25 machine does and when the backup takes 33:30 over it essentially claims the primary's 33:36 Ethernet ID as its own and it starts 33:39 saying you know I'm the owner of that ID 33:41 and then other people on the ethernet 33:42 will start sending us packets that's my 33:46 interpretation the designers believed 34:02 they had identified all such sources and 34:04 for each one of them the primary does 34:07 whatever it is you know executes the 34:10 random number generator instruction or 34:12 takes an interrupt at some time the 34:14 backup does not and the back of virtual 34:17 machine monitor sort of detects any such 34:19 instruction and and intercepts that and 34:22 doesn't do it and he said the backup 34:24 wheats for an event on the logging 34:26 Channel saying this instruction number 34:28 you know the random number was whatever 34:30 it was on the primary 34:35 Edwige 34:37 yes yes 34:42 yeah the paper hints that they got Intel 34:46 to add features to the microprocessor to 34:50 support exactly this but they don't say 34:54 what it was okay 35:04 okay so on that topic the so far that 35:08 you know the story is sort of assumed 35:09 that as long as the backup to sees the 35:16 package from the clients it'll execute 35:17 in identically to the primary and that's 35:21 actually glossing over some huge and 35:25 important details so one problem is that 35:30 as a couple of people have mentioned 35:31 there are some things that are 35:33 non-deterministic now it's not the case 35:36 that every single thing that happens in 35:37 the computer is a deterministic function 35:39 of the contents of the memory of the 35:41 computer it is for a sort of straight 35:44 line code execution often but certainly 35:46 not always so worried about is things 35:49 that may happen that are not a strict 35:51 function of the current state that is 35:53 that might be different if we're not 35:54 careful on the primary and backup so 35:56 these are sort of non-deterministic 35:58 events that may happen so the designers 36:04 had to sit down and like figure out what 36:05 they all work and here are the ones 36:10 here's the kind of stuff they talked 36:12 about so one is inputs from external 36:16 sources like clients which arrive just 36:18 whenever they arrive right they're not 36:20 predictable there are no sense in which 36:21 the time at which a client request 36:24 arrives or its content is a 36:25 deterministic function of the services 36:27 state because it's not so these actually 36:31 this system is really dedicated to a 36:34 world in which services only talk over 36:37 the network and so the only really 36:39 basically the only form of input or 36:41 output in this system is supported by 36:44 this system seems to be network packets 36:46 coming and going so we didn't put 36:48 arrives at what that really means it's a 36:50 packet 36:50 arrives and what a packet really 36:53 consists of for us is the data in the 36:56 packet plus the interrupt 37:01 that's signaled that the packet had 37:05 arrived so that's quite important so 37:08 when a packet arrives 37:11 I'm ordinarily the NIC DMA is the packet 37:16 contents into memory and then raises an 37:20 interrupt which the operating system 37:22 feels and the interrupt happens at some 37:23 point in the instruction stream and so 37:26 both of those have to look identical on 37:29 the primary and backup or else we're 37:30 gonna have they're also executions gonna 37:33 diverge and so you know the real issue 37:35 is when the interrupt occurs exactly at 37:38 which instruction the interrupts happen 37:40 to occur and better be the same on the 37:42 primary in the backup otherwise their 37:44 execution is different and their states 37:46 are gonna diverge and so we care about 37:49 the content of the packet and the timing 37:50 of the interrupt and then as a couple of 37:54 people have mentioned there's a few 37:56 instructions that that behave 38:04 differently on different computers or 38:06 differently depending on something like 38:09 there's maybe a random number generator 38:11 instruction there's I get time-of-day 38:13 instructions that will yield different 38:15 answers have called at different times 38:16 and unique ID instructions another huge 38:21 source of non determinism which the 38:22 paper basically rules out is multi-core 38:27 parallelism is a unit process or only 38:33 system there's no multi-core in this 38:34 world the reason for this is that if it 38:36 allowed multi-core then then the service 38:40 would be running on multiple cores and 38:41 the instructions of the service the rest 38:45 of you know the different cores are 38:45 interleaved in some way which is not 38:48 predictable and so really if we run the 38:50 same code on the on the backup in the 38:52 server if it's parallel code running on 38:54 a multi-core the tubo interleave the 38:56 instructions in the two cores in 38:58 different ways the hardware will and 39:00 that can just cause 39:03 different results because you know 39:05 supposing the code and the two cores you 39:08 know they both asked for a lock on some 39:10 data well on the master you know 39:13 core one may get the lock before Core 2 39:15 on the slave just because of a tiny 39:17 timing difference core to may got the 39:19 lock first and the you know execution 39:21 results are totally different likely to 39:23 be totally different if different 39:25 threads get the lock 39:26 so multi-core is the grim source of 39:30 non-determinism man is just totally 39:32 outlawed in this papers world and indeed 39:36 like as far as I can tell the techniques 39:39 are not really applicable the service 40:00 can't use multi-core parallel 40:01 parallelism the hardware is almost 40:04 certainly multi-core parallel but that's 40:06 the hardware sitting underneath the 40:09 virtual machine monitor the machine that 40:11 the virtual machine monitor exposes to 40:13 one of the guest operating systems that 40:15 runs the primary backup that emulated 40:18 virtual machine is a unicorn it's a unit 40:21 processor machine in this paper and I'm 40:25 guessing there's not an easy way for 40:26 them to adapt this design to multi-core 40:31 virtual machines 40:39 okay so so these are really it's it's 40:43 it's these events that go over the 40:44 logging channel and so the format of a 40:49 log record a log log entry they don't 40:55 quite say but I'm guessing that there's 40:57 really three things in a log entry 40:58 there's the instruction number at which 41:01 the event occurred because if you're 41:02 delivering an interrupt or you know 41:04 input or whatever it better be delivered 41:06 at exactly the same place in the primary 41:09 backup so we need to know the 41:10 instruction number and by instruction 41:11 number I mean you know the number of 41:14 instructions since the Machine booted 41:15 why not the instruction address but like 41:18 oh or executing the four billion and 41:20 79th instructions since boot so log 41:23 entry is going to have instruction 41:24 number four an interrupt for input it's 41:31 going to be the instruction at which the 41:34 interrupt was delivered on the primary 41:35 and for a weird instruction like get at 41:39 time of day it's going to be the 41:41 instruction number of the instruction of 41:43 the get time of day or whatever 41:44 instruction that was executed on the 41:46 primary so that you know the backup 41:49 knows where to where to call this event 41:52 to occur okay so there's gonna be a type 41:54 you know network input whatever a weird 41:58 instruction and then there's I'm gonna 42:00 be data for a packet arrival it's gonna 42:03 be the packet data for one of these 42:05 weird instructions it's going to be the 42:06 result of the instruction when it was 42:08 executed on the primary so that the 42:10 backup virtual machine can sort of fake 42:13 the instruction and supply that same 42:15 result 42:22 okay so so as an example the both of 42:27 these operating systems guest operating 42:34 system assumes requires that the 42:38 hardware in this case emulated hardware 42:40 virtual machine has a timer that ticks 42:42 say a hundred times a second and causes 42:44 interrupts to the operating system and 42:47 that's how the operating system keeps 42:49 track of time it's by counting these 42:51 timer interrupts so the way that plays 42:54 out those timer notice why they have to 42:56 happen at exactly the same place in the 42:58 primary and backup otherwise they don't 43:00 execute the same no diverge so what 43:04 really happens is that the there's 43:06 there's a timer on the physical machine 43:10 that's running the Ft virtual machine 43:14 monitor and the timer on the physical 43:16 machine ticks and delivers an interrupt 43:18 a timer and up to the virtual machine 43:19 monitor on the primary the virtual 43:23 machine monitor at you know the 43:24 appropriate moment stops the execution 43:29 of the primary writes down the 43:31 instruction number that it was at you 43:34 know instruction since boot and then 43:37 delivers sort of fake simulates and 43:39 interrupts into the guest operating 43:41 system in the primary at that 43:43 instruction number saying oh you know 43:44 you're emulating the timer Hardware just 43:46 ticked 43:47 there's the interrupt and then the 43:49 primary virtual machine monitor sends 43:51 that instruction number which the 43:52 interrupt happened you know to the 43:54 backup the backup of course it's virtual 43:59 machine monitor is also taking timer 44:00 interrupts from its physical timer and 44:02 it's not giving them it's not giving 44:04 it's a real physical timer interrupts to 44:06 the to the backup operating system it's 44:10 just ignoring them when the law when the 44:13 log entry for the primaries timer 44:15 interrupts arrives here then the backup 44:18 virtual machine monitor will arrange 44:20 with the CPU and this requires special 44:22 CPU support to cause the physical 44:28 machine to interrupt at the same 44:30 instruction number 44:32 at the timer interrupts tapped into the 44:34 primary at that point the virtual 44:36 machine monitor gets control again from 44:38 the guest and then fakes the timer 44:41 interrupts into the backup operating 44:43 system now exact exactly the same 44:46 instruction number as it occurred on the 44:47 primary well yeah so the observation is 45:17 that this will this relies on the CPU 45:18 having some special hardware in it where 45:20 the vmm can tell the hardware CPU please 45:24 interrupt a thousand instructions from 45:26 now and then the vmm you know where so 45:29 that you know it'll interrupt at the 45:32 right instruction number the same 45:34 instruction as the primary did and then 45:35 the vmm just tells the cpu to start X 45:38 resume executing again in the backup and 45:40 exactly a thousand instructions later 45:42 the CPU will force an interrupt into the 45:44 virtual machine monitor and that that's 45:46 special hardware but it turns out it's 45:48 you know on all Intel chips so it's not 45:51 it's not that special anymore you know 45:53 15 years ago it was exotic now it's 45:56 totally normal and it turns out there's 45:59 a lot of other uses for it like um if 46:01 you want to do profiling you wanna do 46:02 CPU time profiling what you'd really 46:04 like or one way to do CPU time profiling 46:07 is to have the microprocessor interrupt 46:09 every thousand instructions right and 46:11 this is the hardware that's this 46:13 Hardware also this is the same hardware 46:15 that would cause the microprocessor to 46:17 generate an interrupt every thousand 46:18 instructions so it's a very natural sort 46:21 of gadget to want in your CPU 46:31 all right yes 46:54 what if the backup gets ahead of the 46:56 primary so you know we standing above 46:59 know that oh you know the primary is 47:02 about to take an interrupt at the 47:04 millionth instruction but the backup is 47:08 already you know executed the millionth 47:11 and first instruction so it's gonna be 47:14 if we let this happen it's gonna be too 47:16 late to deliver the interrupts if we let 47:19 the backup xu the head of the primary 47:21 it's going to be too late to deliver the 47:23 interrupts at the same point in the 47:26 primary instruction stream and the 47:27 backup of the instruction stream so we 47:29 cannot let that happen we cannot let the 47:31 backup get ahead of the primary in 47:33 execution and the way VMware aft does 47:37 that is that the the backup virtual 47:45 machine monitor it actually keeps a 47:46 buffer of waiting events that have 47:49 arrived from the primary and it will not 47:53 let to the backup execute unless there's 47:56 at least one event in that buffer and if 47:58 there's one event in that buffer then it 48:01 will know from the instruction number 48:02 the place at which it's got a force the 48:07 backup to stop executing so always 48:10 always the backup is executing with the 48:14 CPU being told exactly where the next 48:17 stopping point the next instruction 48:19 number of a stopping point is because 48:21 the backup only executes if it has a an 48:24 event here that tells it where to stop 48:26 next so that means it starts up after 48:30 the primary because the backup can't 48:31 even start executing until the primary 48:33 has generated the first event and that 48:35 event has arrived at the backup so the 48:37 backup sort of always one event 48:39 basically behind the at least one event 48:41 behind the primary and if it's slower 48:43 for some other whatever reason maybe 48:44 there's other stuff running on that 48:45 physical machine then the backup might 48:47 get you know multiple events behind at 48:50 the primary 48:58 alright there's a one little piece of 49:03 mess about arriving the specific case of 49:05 arriving packets ordinarily when a 49:16 packet arrives from a network interface 49:17 card if we weren't running a virtual 49:19 machine the network interface card would 49:22 DMA the packet content into the memory 49:24 of the computer that it's attached to 49:27 sort of as the data arrives from the 49:30 network interface card and that means 49:33 you know you should never write software 49:34 like this but it could be that the 49:38 operating system that's running on a 49:39 computer might actually see the data of 49:41 a packet as its DMA or copied from the 49:44 network interface card into memory right 49:46 you know this is and you know we don't 49:49 know what operating this system is 49:51 designed so that it can support any 49:52 operating system and cost maybe there is 49:53 an operating system that watches 49:56 arriving packets in memory as they're 49:58 copied into memory so we can't let that 50:01 happen because if the primary happens to 50:04 be playing that trick it's gonna see you 50:08 know if we allowed the network interface 50:10 card to directly DMA incoming packets 50:13 into the memory of the primary the 50:15 primary we don't have any control over 50:17 the exact timing of when the network 50:20 interface card copies data into memory 50:22 and so we're not going to know sort of 50:24 at what times the primary did or didn't 50:28 observe data from the packet arriving 50:32 and so what that means is that in fact 50:34 the NIC copies incoming packets into 50:39 private memory of the virtual machine 50:41 monitor and then the network interface 50:43 card interrupts the virtual machine 50:45 monitor and says oh a packet has arrived 50:46 at that point the virtual machine 50:48 monitor will suspend the primary and 50:51 remember what instruction number had 50:52 suspended at copy the entire packet into 50:56 the primaries memory while the primary 50:57 suspended and not looking at this copy 51:00 and then emulate a network interface 51:03 card interrupt into the primary 51:05 and then send the packet and the 51:10 instruction number to the backup the 51:13 backup will also suspend the backup rope 51:16 you know virtual machine monitor will 51:17 spend the backup at that instruction 51:18 number copy the entire packet and again 51:21 to the back-up is guaranteed not to be 51:23 watching the data arrive and then fakin 51:25 interrupts at the same instruction 51:27 numbers the primary and this is the 51:30 something the bounce buffer mechanism 51:34 explained in the paper okay yeah the the 51:57 only instructions and that result in 51:59 logging channel traffic or are weird 52:03 instructions which are rare no its 52:06 instructions that might yield a 52:09 different result if executed on the 52:10 primary and backup like instruction to 52:12 get the current time of day or current 52:14 processor number or ask how many 52:15 instructions have been executed or and 52:18 those actually turn out to be relatively 52:19 rare there's also one them to get random 52:22 tasks when some machines to ask or a 52:24 hardware generated random number for 52:25 cryptography or something and but those 52:28 are not everyday instructions most 52:30 instructions like add instructions 52:31 they're gonna get the same result on 52:33 primary and that go 52:44 yeah so the way those get replicated on 52:47 the back up is just by forwarding that's 52:51 exactly right each network packet just 52:52 it's packaged up and forwarded as it is 52:55 as a network packet and is interpreted 52:57 by the tcp/ip stack on both you know so 53:02 I'm expecting 99.99% of the logging 53:07 channel traffic to be incoming packets 53:09 and only a tiny fraction to be results 53:12 from special non-deterministic 53:14 instructions and so we can kind of guess 53:17 what the traffic load is likely to be 53:20 for for a server that serves clients 53:22 basically it's a copy of every client 53:24 packet and then we'll sort of know what 53:27 the logging channel how fast the logging 53:29 channel has to be all right so um so 53:40 it's worth talking a little bit about 53:42 how output works and in this system 53:44 really the only what output basically 53:46 means only is sending packets that 53:49 client send requests in as network 53:51 packets the response goes back out as 53:54 network packets and there's really no 53:56 other form of output as I mentioned the 54:00 you know both primary and backup compute 54:02 the output packet they want to send and 54:04 that sort of asks that simulated mix to 54:06 send the packet it's really sent on the 54:08 primary and simply discard it the output 54:10 packet discarded on the backup okay but 54:15 it turns out is a little more 54:17 complicated than that so supposing we're 54:21 what we're running is a some sort of 54:24 simple database server and the operation 54:27 the client operation that our database 54:28 server supports is increment and ideas 54:31 the client sends an increment requests 54:33 the database server increments the value 54:36 and sends back the new value so maybe on 54:39 the primary well let's say everything's 54:41 fine so far and the primary backup both 54:43 have value 10 in memory and that's the 54:47 current value at the counter and some 54:51 client on the local area network sends a 54:53 you know an increment request to 54:58 the primary that packet is you know 55:03 delivered to the primary it's you know 55:04 it's executed the primary server 55:07 software and the primary says oh you 55:08 know current values 10 I'm gonna change 55:10 to 11 and send a you know response 55:13 packet back to the client saying saying 55:16 11 as their reply the same request as I 55:20 mentioned gonna supposed to be sent to 55:22 the backup will also be processed here 55:25 it's going to change this 10 to 11 also 55:26 generate a reply and we'll throw it away 55:28 that's what's supposed to happen the 55:30 output however you also need to ask 55:33 yourself what happens if there's a 55:37 failure at an awkward time if you should 55:39 always in this class should always ask 55:42 yourself what's the most awkward time to 55:44 have a failure and what would happen you 55:46 to failure occurred then so suppose the 55:54 primary does indeed generate the reply 55:58 here back to the client but the client 56:01 the primary crashes just after sending 56:03 the report its reply to the client and 56:05 furthermore and much worse it turns out 56:08 that you know this is just a network it 56:10 doesn't guarantee to deliver packets 56:12 let's suppose this log entry on the 56:16 logging channel got dropped also when 56:18 the when the primary died so now the 56:21 state of play is the client received a 56:23 reply saying 11 but the backup did not 56:28 get the client request so its state is 56:29 still 10 no now the backup takes over 56:34 because it's seized the primary is dead 56:37 and this client or maybe some other 56:39 client sends an increment request a new 56:41 backup and now it's really processing 56:43 these requests and so the new backup 56:45 when it gets the next increment requests 56:47 you know it's now going to change its 56:49 state to 11 and generate a second 11 56:55 response maybe the same client maybe to 56:58 a different client which if the clients 57:00 compare notes or if it's the same client 57:01 it's just obviously cannot have happened 57:04 I didn't so you know because we have to 57:07 support unmodified software that does 57:10 not 57:11 damn that there's any funny business of 57:13 replication going on that means we do 57:15 not have the opportunity to you know you 57:17 can imagine the client could go you know 57:19 we could change the client to realize 57:20 something funny it happened with the 57:22 fault tolerance and do I don't know what 57:24 but we don't have that option here 57:25 because this whole system really only 57:27 makes sense if we're running unmodified 57:29 software so so this was a big this is a 57:33 disaster we can't have let this happen 57:38 does anybody remember from the paper how 57:43 they prevent this from happening the 57:45 output rule yeah so you want to do you 57:48 know yeah so the output rules is the 57:57 their solution to this problem and the 58:03 idea is that the client he's not allowed 58:05 to generate you know and generate any 58:08 output the primary's not allowed to 58:10 generate any output and what we're 58:11 talking about now is this output here 58:13 until the backup acknowledges that it 58:17 has received all log records up to this 58:21 point so the real sequence at the 58:24 primary then let's now undone crash the 58:27 primary go back to them starting at 10 58:32 the real sequence now when the output 58:34 rule is that the input arrives at the 58:40 time the input arrives that's when the 58:42 virtual machine monitor sends a copy of 58:46 the input to the backup so the the sort 58:50 of time at which this log message with 58:53 the input is sent is before strictly 58:55 before the primary generates the output 58:58 sort of obvious then after firing this 59:03 log entry off across a network and now 59:05 it's heading towards the backup but I'd 59:08 have been lost my not the virtual 59:13 machine monitor delivers a request to 59:14 the primary server software it generates 59:16 the output so now the 59:20 replicated you know the primary has 59:22 actually generated change the state 211 59:24 and generated an output packet that says 59:26 eleven but the virtual machine monitor 59:28 says oh wait a minute we're not allowed 59:29 to generate that output until all 59:31 previous log records have been 59:32 acknowledged by the backup so you know 59:34 this is the most recent previous log 59:37 message so this output is held by the 59:39 virtual machine monitor until the this 59:42 log entry containing the input packet 59:44 from the client is delivered to the 59:47 virtual machine monitor and buffered by 59:48 the virtual machine monitor but do not 59:50 necessarily execute it it may be just 59:52 waiting for the backup to get to that 59:54 point in the instruction stream and then 59:56 the virtual machine monitor here will 59:59 send an active packet back saying yes I 60:00 did get that input and when the 60:02 acknowledgment comes back only then will 60:05 the virtual machine monitor here release 60:07 the packet out onto the network and so 60:11 the idea is that if the client could 60:12 have seen the reply then necessarily the 60:16 backup must have seen the request and at 60:18 least buffered it and so we no longer 60:22 get this weird situation in which a 60:25 client can see a reply but then there's 60:27 a failure and a cut over and the replica 60:29 didn't know anything about that reply if 60:33 the you know there's also a situation 60:36 maybe this message was lost and if this 60:40 log entry was lost and then the primary 60:43 crashes well since it hadn't been 60:45 delivered so the backup hadn't sent the 60:47 act that means if the primary crashed 60:49 you know this log entry was brought in 60:52 the primary crashed it must have crashed 60:53 before the virtual machine monitor or at 60:56 least the output packet and prayer for 60:58 this client couldn't have gotten the 61:00 reply and so it's not in a position to 61:03 spot any irregularities they're already 61:09 happy with the output rule 61:27 brennon see I don't know they don't 61:31 paper doesn't mention how the virtual 61:34 machine monitor is implemented I mean 61:35 it's pretty low level stuff because you 61:38 know it's sitting there allocating 61:39 memory and figuring page tables and 61:41 talking to device drivers and 61:43 intercepting instructions and 61:44 understanding what instructions the 61:46 guest was executing so we're talking 61:49 about low-level stuff what language is 61:51 written in you know traditionally C or 61:53 C++ but I don't actually know okay this 61:59 of the primary has to delay at this 62:02 point waiting for the backup to say that 62:07 it's up to date this is a real 62:09 performance thorn in the side of just 62:12 about every replication scheme this sort 62:15 of synchronous wait where the we can't 62:18 let the primary get too far ahead of the 62:19 backup because if the primary failed 62:22 while it was ahead that would be the 62:24 backup lagging lagging behind clients 62:27 right so just about every replication 62:30 system has this problem that at some 62:31 point the primary has to stall waiting 62:34 for the backup and it's a real limit on 62:36 performance even if the machines are 62:38 like side-by-side and adjacent racks 62:40 it's still you know we're talking about 62:41 a half a millisecond or something to 62:44 send messages back and forth with a 62:45 primary stalled and if we wanna like 62:49 withstand earthquakes or citywide power 62:51 failures you know the primary in the 62:53 backup have to be in different cities 62:54 that's probably five milliseconds apart 62:56 every time we produce output if we 62:59 replicate in the two replicas in 63:01 different city every packet that it 63:03 produces this output has to first wait 63:05 the five milliseconds or whatever to 63:08 have the last log entry get to the 63:09 backup and how they need Osment come 63:11 back and then we can release a path 63:12 packet and you know for sort of low 63:15 intensity services that's not a problem 63:18 but if we're building a you know 63:19 database server that we would like to 63:21 you know that if it weren't for this 63:22 could process millions of requests per 63:25 second then 63:25 that's just unbelievably damaging for 63:28 performance and this is a big reason why 63:31 people you know you know if they 63:34 possibly can use a replication scheme 63:38 that's operating at a higher level and 63:39 kind of understands the semantics of 63:41 operations and so it doesn't have to 63:42 stall on every packet you know it could 63:45 stall on every high level operation or 63:47 even notice that well you know read-only 63:49 operations don't have to stall at all 63:51 it's only right so that just all or 63:52 something but you have to there has to 63:54 be an application level replication 63:55 scheme to to realize that you're 64:04 absolutely right so the observation is 64:06 that you don't have to stall the 64:07 execution of the primary you only have 64:08 to hold the output and so maybe that's 64:11 not as bad as it could be but 64:13 nevertheless it means that every you 64:16 know in a service that could otherwise 64:17 have responded in a couple of 64:19 microseconds to the client you know if 64:22 we have to first update the replicas in 64:24 the next city we turn to you know 10 64:27 micro second interaction into it 10 64:29 millisecond interactions possibly if you 64:36 have vast numbers of clients submitting 64:39 concurrent requests then you may may be 64:41 able to maintain high throughput even 64:43 with high latency but you have to be 64:46 lucky to or very clever designer to get 64:49 that 65:01 that's a great idea but if you log in 65:04 the memory of the primary that log will 65:06 disappear when the primary crashes or 65:08 that's usual semantics of a server 65:10 failing is that you lose everything 65:13 inside the box like the contents of 65:16 memory or you know if even if you didn't 65:19 if the failure is that somebody 65:21 unplugged the power cable accidentally 65:23 from the primary even if the primary 65:25 just has battery backed up RAM or I 65:27 don't know what you can't get at it 65:30 all right the backup can't get at it so 65:32 in fact this system does log the output 65:36 and the place it logs it is in the 65:37 memory of the backup and in order to 65:39 reliably log it there you have to 65:42 observe the output rule and wait for the 65:43 acknowledgment so it's entirely correct 65:46 idea just can't use the primary's memory 65:48 for it yes 65:58 say it again that's a clever idea I'd 66:06 and so the question is maybe input 66:08 should go to the primary but output 66:11 should come from the backup 66:12 I completely haven't thought this 66:14 through that might work that 66:17 I don't know that's interesting 66:29 yeah maybe I will 66:42 okay one possibility this does expose 66:48 though is that the situation you know 66:56 maybe the a primary crashes after its 66:58 output is released so the client does 67:00 receive the reply then the primary 67:02 crashes the backups input is still in 67:07 this event buffer in the virtual machine 67:09 monitor of the backup it hasn't been 67:13 delivered to the actual replicated 67:15 service when the backup goes live after 67:18 the crash of the primary the backup 67:22 first has to consume all of the sort of 67:25 log records that are lying around that 67:27 it hasn't consumed X it has to catch up 67:29 to the primary otherwise it won't take 67:30 over with the same state so before the 67:33 backup can go live it actually has to 67:34 consume all these entries the last entry 67:37 is presumably is the request from the 67:41 client so the backup will be live after 67:45 after it after the interrupt that 67:49 delivers the request from the client and 67:51 that means that the backup well you know 67:54 increment its counter to eleven and then 67:56 generate an output packet and since it's 67:58 live at this point it will generate the 68:01 output packet and the client will get to 68:04 eleven replies which is also if it if 68:10 that really happened would be anomalous 68:15 like possibly not something that could 68:18 happen if there was only one server the 68:22 good news is that almost certainly or 68:25 the almost certainly the client is 68:27 talking to this service using TCP and 68:29 that this is the request and the 68:30 response go back and forth on a TCP 68:32 Channel the when the backup takes over 68:35 the backup since the state is identical 68:37 to the primaries it knows all about that 68:39 TCP connection and whether all the 68:40 sequence numbers are and whatnot and 68:43 when it generates this packet it will 68:46 generate it with the same TCP sequence 68:49 number as an original packet and the TCP 68:52 stack on the client will say oh wait a 68:53 minute that's a duplicate packet 68:55 we'll discard the duplicate packet at 68:57 the TCP level and the user level 68:59 software will just never see this 69:00 duplicate and so this system really you 69:04 know you can view this as a kind of 69:09 accidental or clever trick but the fact 69:11 is for any replication system where 69:14 cutover can happen which is to say 69:16 pretty much any replication system it's 69:20 essentially impossible to design them in 69:22 a way that they are guaranteed not to 69:24 generate duplicate output basically you 69:29 know you well you can err on either side 69:31 I'm not even either not generate the 69:33 output at all which would be bad which 69:36 would be terrible or you can generate 69:37 the output twice on a cutover that's 69:41 basically no way to generate it 69:42 guaranteed generated only once everybody 69:44 errors on the side of possibly 69:46 generating duplicate output and that 69:49 means that at some level you know the 69:51 client side of all replication schemes 69:53 need some sort of duplicate detection 69:55 scheme here we get to use TCP s that we 69:57 didn't have TCP that would have to be 69:59 something else maybe application level 70:01 sequence numbers or I don't know what 70:03 and you'll see all of this and actually 70:06 you'll see versions of essentially 70:09 everything I've talked about like the 70:10 output rule for example in labs 2 & 3 70:14 you'll design your own replicated state 70:17 machine yes 70:45 yes to the first part so the scenario is 70:48 the primary sends the reply and then 70:51 either the primary send the clothes 70:53 packet or the client closes the connect 70:55 the TCP connection after it receives the 70:57 primary's reply so now there's like no 70:58 connection on the client side but there 71:00 is a connection on the backup side and 71:02 so now the backup so the backup consumes 71:06 the very last log entry that as the 71:07 input is now live so we're not 71:10 responsible for replicating anything at 71:12 this point right because the backup now 71:14 live there's no other replica as the 71:16 primary died so there's no like if if we 71:20 don't if the backup fails to execute in 71:23 lockstep with the primary that's fine 71:24 actually because the primary is is dead 71:26 and we do not want to execute in 71:28 lockstep with it okay so the primer is 71:30 now not it's live it generates an output 71:33 on this TCP connection that isn't closed 71:37 yet from the backup point of view this 71:39 packet arrives with the client on a CCP 71:41 connection that doesn't exist anymore 71:43 from the clients point of view like no 71:45 big whoopee on the client right he's 71:46 just going to throw away the packet as 71:48 if nothing happened the application 71:50 won't no the client may send a reset 71:52 something like a TCP error or whatever 71:54 packet back to the backup and the backup 71:57 does something or other with it but it 71:58 doesn't matter because we're not 72:00 diverging from anything because there's 72:02 no primary to diverge from you can just 72:04 handle a stray we said however it likes 72:08 and what it'll in fact do is basically 72:10 ignore but there's no now the backup has 72:14 gone live there's just no we don't owe 72:17 anybody anything as far as replication 72:19 yeah 72:36 well you can bet since the backups 72:39 memory image is identical to the 72:40 primaries image that they're sending 72:42 packets with the very same source TCP 72:45 number and they're very same everything 72:48 they're sending bit for bit identical 72:51 packets you know at this level the 73:00 server's don't have IP addresses or for 73:03 our purposes the virtual machines you 73:06 know the primary in the back up virtual 73:08 machines have IP addresses but the the 73:12 physical computer and the vmm are 73:15 transparent to the network it's not 73:17 entirely true but it's basically the 73:19 case that the virtual machine monitor in 73:21 the physical machine don't really have 73:23 identity of their own on the network 73:26 because you can configure that then that 73:29 way instead these they're not you know 73:31 the virtual machine with a sewing 73:33 operating system in its own TCP stack it 73:35 doesn't IP address underneath there an 73:36 address and all this other stuff which 73:37 is identical between the primary in the 73:39 backup and when it sends a packet it 73:41 sends it with the virtual machines IP 73:42 address and Ethernet address and those 73:44 bits least in my mental model are just 73:49 simply passed through on to the local 73:51 area network it's exactly what we want 73:54 and so I think he doesn't generate 73:55 exactly the same packets that the 73:57 primary would have generated there's 73:59 maybe a little bit of trickery 74:00 you know what the we if this is these 74:03 are actually plugged into an Ethernet 74:04 switch into the physical machines maybe 74:06 it wasn't in two different ports of an 74:08 Ethernet switch and we'd like the 74:09 Ethernet switch to change its mind about 74:12 which of these two machines that 74:14 delivers packets with replicated 74:18 services Ethernet address and so there's 74:20 a little bit of funny business there for 74:23 the most part they're just generating 74:24 identical packets so let me just send 74:26 them out 74:29 okay so another little detail I've been 74:33 glossing over is that I've been assuming 74:36 that the primary just fails or the 74:38 backup just fails that is fail stop 74:41 right but that's not the only option 74:43 another very common situation that has 74:46 to be dealt with is if the two machines 74:49 are still up and running and executing 74:51 but there's something funny happen on 74:53 the network that causes them not to be 74:56 able to talk to each other but to still 74:58 be able to talk to some clients so if 75:01 that happened if the primary backup 75:03 couldn't talk to each other but they 75:05 could still talk to the clients they 75:06 would both think oh the other replicas 75:07 dead I better take over and go live and 75:10 so now we have two machines going live 75:12 with this service and now you know 75:14 they're no longer sending each other log 75:16 events or anything they're just 75:17 diverging maybe they're accepting 75:19 different client inputs and changes are 75:21 stayed in different ways so now we have 75:22 a split brain disaster if we let the 75:24 primary in the backup go live because it 75:28 was a network that has some kind of 75:30 failure instead of these machines and 75:34 the way that this paper solves it I mean 75:36 is by appealing to an outside authority 75:41 to make the decision about which of the 75:44 primary of the backup is allowed to be 75:46 live and so it they're you know it turns 75:53 out that their storage is actually not 75:54 on local disk this almost doesn't matter 75:56 but their storage is on some external 75:58 disk server and as well as being in this 76:01 server as a like totally separate 76:03 service there's nothing to do with disks 76:05 there this server happens to abort this 76:07 test and set test and set service over 76:15 the network where you you can send a 76:17 test and set request to it and there's 76:19 some flag it's keeping in memory and 76:21 it'll set the flag and return what the 76:23 old value was so both primary and backup 76:25 have to sort of acquire this test and 76:28 set flag it's a little bit like a lock 76:30 in order to go live they both may be 76:32 send test and set requests at the same 76:34 time to this test and set server the 76:37 first one gets back a reply that says oh 76:39 the flag used to be zero now it's one 76:41 this 76:42 second request to arrive the response 76:44 from the testing set server is Oh 76:46 actually the flag was already one when 76:47 your request arrived so so basically 76:50 you're not allowed to be primary and so 76:52 this this test and set server and we can 76:55 think of it as a single machine is the 76:58 arbitrator that decides which of the two 77:00 should go live if they both think the 77:02 other ones dead due to a network 77:04 partition any questions about this 77:08 mechanism you're busted yeah if the test 77:17 and set server should be dead at the 77:19 critical moment when and so actually 77:22 even if there's not a network partition 77:24 under all circumstances in which one or 77:27 the other of these wants to go live 77:28 because it thinks the others dead even 77:30 when the other one really is dead the 77:32 one that wants to collide still has to 77:33 acquire the test and set lock because 77:35 one of like the deep rules of 6:18 for 77:39 game is that you cannot tell whether or 77:43 another computer is dead or not all you 77:45 know is that you stopped receiving 77:47 packets from it and you don't know 77:49 whether it's because the other computer 77:50 is dead or because something has gone 77:53 wrong with the network between you and 77:55 the other computer so all the backup 77:57 ceases well I've stuck in packets maybe 77:59 the primary is dead maybe it's live 78:00 primary probably sees the same thing so 78:03 if there's a network partition they 78:04 certainly have to ask the TAT since that 78:05 server but since they don't know if it's 78:07 a network partition they have to ask the 78:08 testing set server regardless of whether 78:11 it's a partition or not so anytime 78:13 either wants to collide the test and set 78:15 server also has to be alive because they 78:17 always have to acquire this testing set 78:19 lock so the test and set server 78:22 sounds like a single point of failure 78:24 they were trying to build a replicated 78:26 fault tolerant whatever thing but in the 78:29 end you know we can't failover unless 78:30 unless this is alive so that's a bit of 78:35 a bummer 78:36 I'm guessing though I'm making a strong 78:39 guess that the test and set server is 78:41 actually itself a replicated service and 78:44 is fault tolerant right it's almost 78:46 certainly I mean these people are being 78:49 where they're like happy to sell you a 78:50 million dollar highly available storage 78:53 system that 78:54 uses enormous amounts of replication 78:56 internally um since the testing set 78:59 thing is on there dis server I'm I'm 79:00 guessing it's replicated too and the 79:03 stuff you'll be doing in lab 2 in lab 3 79:05 is more than powerful enough for you to 79:07 build your own fault-tolerant test and 79:11 set server so this problem can easily be 79:13 eliminated