字幕記錄 00:00 all right today's topic is distributed 00:03 transactions and these come in really to 00:15 implementation pieces and that's how 00:17 I'll cover them the first big piece of 00:20 concurrency control the second is atomic 00:28 commit and the reason why distributive 00:35 transactions come up is that it's very 00:36 frequent for people with large amounts 00:38 of data to end up splitting or sharding 00:41 the data over many different servers so 00:43 maybe if you're running a bank for 00:45 example the bank balances for half your 00:49 customers are one server and the bank 00:51 balances for the other half are on a 00:53 different server let's do it like split 00:54 the load both the processing load and 00:56 the space requirements this comes up for 01:00 other things too maybe you're recording 01:02 vote counts on articles at a website you 01:05 know the maybe there's so many millions 01:07 millions of articles half the vote 01:08 counts are and are on one server and 01:11 half the vote cancer or another but some 01:15 operations require touching modifying or 01:18 reading data on multiple different 01:20 servers so if we're doing a bank 01:21 transfer from one customer into another 01:23 well their balances may be on different 01:25 servers and therefore in order to do the 01:27 balance we have to modify data read and 01:29 write data on two different servers and 01:33 we'd really like to or one way building 01:37 these systems and we'll see others later 01:39 on in the course one way to build the 01:40 system just try to hide the complexity 01:43 of splitting this data across multiple 01:46 servers try to hide it from the 01:47 application programmer and this is like 01:51 traditionally has been a database 01:53 concern for for many decades and so a 01:56 lot of today's material originated with 01:58 databases but the ideas have been used 02:00 much more widely in distributed systems 02:03 which you wouldn't necessarily call a 02:04 traditional database the way people sort 02:09 of usually package up concurrency 02:13 control plus 02:16 atomic commit is in abstraction called a 02:23 transaction which we've seen before and 02:30 the idea is that the programmer you know 02:33 has a bunch of different operations may 02:36 be on different records in the database 02:37 they'd like all those operations to be 02:40 sort of a single unit and not split by 02:43 failures or by observation from other 02:45 activities and the transaction 02:50 processing system will require the 02:52 programmer to mark the beginning and the 02:53 end of that sequence of reading and 02:56 writing and updating operations in order 02:58 to mark the beginning and of the 02:59 transaction and the transaction 03:00 processing system has certainly will 03:03 provide certain guarantees about what 03:05 happens between the beginning and the 03:07 end 03:07 so for example supposing we're running 03:11 our bank and we want to do a transfer 03:15 from account of user X to the account of 03:19 user wide now these balances from both 03:21 of them start out as 10 so initially 03:23 expose 10 y equals 10 and x and y I'm 03:30 mean to be records in a database and we 03:35 want to transfer we will actually 03:38 imagine that there's two transactions 03:40 that might be running at the same time 03:41 one to transfer a dollar from account X 03:44 to account Y and the other transaction 03:47 to do an audit of of all the accounts at 03:49 the bank to make sure that the total 03:51 amount of money in the bank never 03:53 changes because after all if you do 03:54 transfers you know the total shouldn't 03:56 change even if you move money between 03:58 accounts in order to express this with 04:01 transactions we might have two 04:04 transactions the first transaction call 04:07 it t1 is the transfer well mark the 04:10 programmer is expected to mark the 04:12 beginning of it with the begin 04:13 transaction which all right at the 04:17 beginning and then the operations on the 04:21 two balances on the two records in the 04:23 database so we might add 04:25 [Music] 04:27 one might add one the balance X and add 04:34 -1 to Y and then we need to mark the end 04:42 by the transaction currently we might 04:46 have a transaction that's going to check 04:49 all the balance do an audit of all the 04:50 balances find the sum or look at all the 04:52 balances make sure they add up to the 04:54 number that doesn't change despite 04:56 transfers so the second transaction I'm 04:59 thinking about the audit transaction 05:05 also we need to mark the beginning and 05:07 end this time we're just reading there's 05:13 a read-only transaction we need to get 05:16 the current balances of all the accounts 05:19 lists they were just these two accounts 05:21 for now so we have two temporary 05:24 variables we're gonna read the first one 05:27 it's going to be the value of balance X 05:32 just right get to mean we're reading 05:34 that record we also read Y and we print 05:40 them both and that's the end of the 05:48 transaction the question is what are 05:55 legal results from these two 05:57 transactions that's the first thing we 05:59 want to establish is what are you know 06:01 given the starting state namely the two 06:03 balances for ten dollars and what could 06:06 be the final results after you've run 06:07 both these transactions maybe at the 06:09 same time so we need a notion of what 06:12 would be correct and once we know that 06:14 we need to be able to build machinery 06:17 that will actually be able to execute 06:18 these transactions and get only those 06:23 correct answers despite concurrency and 06:25 failures so first what's correctness 06:28 well databases usually have a notion of 06:33 correctness called acid 06:38 or bb-8 is acid and it stands for atomic 06:44 and this means that a transaction that 06:48 has multiple steps 06:49 you know maybe writes multiple different 06:50 records if there's a failure despite 06:53 failures either all of the right should 06:55 be done or none of them it shouldn't be 06:58 the case that a failure at an awkward 07:00 time in the middle of a transaction 07:01 should leave half the updates completed 07:04 invisible and half the updates never 07:06 done it's all or nothing so this is or 07:16 not despite failures the C stands for 07:25 consistent it's actually we're not going 07:32 to worry about that that's usually meant 07:35 to refer to the fact that database will 07:38 enforce certain invariants declared by 07:40 the application it's not really our 07:43 concern today the I though it's quite 07:45 important it usually stands for isolated 07:49 and this is a really a property of 07:52 whether or not two transactions that run 07:54 at the same time can see each other's 07:56 changes before the transactions have 07:58 finished whether or not they can see 08:00 sort of intermediate updates and from 08:02 the middle of another transaction and 08:06 your goal is no and the sort of 08:11 technical specific thing that most 08:14 people generally mean by isolation is 08:17 that the transaction execution is 08:19 serializable and I'll explain what that 08:21 means in a bit but it boils down to 08:27 transactions can't see each other's 08:29 changes can't see intermediate states 08:32 but only complete transaction results 08:34 and the final D stands for durable 08:39 and this means that after a transaction 08:42 commits after the client or whatever 08:44 program that submitted the transaction 08:46 gets a reply back from the database 08:49 saying yes 08:49 you know we've executed your transaction 08:52 the D in acid means that the 08:56 transactions modifications the database 08:58 will be durable that they'll still be 08:59 there they won't be erased by a some 09:02 sort of failure and in practice that 09:06 means that stuff has to be written into 09:08 some non-volatile storage persistent 09:10 storage like a disk and so today you are 09:13 in fact for this whole course really our 09:16 concerns are going to revolve around 09:18 good behavior with respect to failure 09:21 good respect good behavior with respect 09:25 to other from multiple parallel 09:28 activities and making sure that the data 09:31 is there still they are after even if 09:35 something crashes so the most 09:40 interesting part of this for us is the 09:42 specific definition of ice of isolated 09:44 or serializable so I'm going to lay that 09:48 out before before talking about how it 09:51 actually applies to these transactions 09:52 so the ioan isolated is usually and the 10:03 definition for this if a set of 10:06 transactions executes you know 10:10 concurrently more or less at the same 10:12 time they you are the set of results and 10:16 here the results refer to both the new 10:19 database records created by any 10:21 modifications the transactions might do 10:23 and in addition any output that the 10:26 transaction is produced so broader 10:28 transactions these two adds since they 10:30 change records their needs change 10:32 records are part of the results and the 10:34 output of this print statement is part 10:35 of the results so the definition of 10:38 serializable says the results are 10:42 serializable 10:47 if there exists some order of execution 10:55 of the transactions so we're gonna say a 11:20 specific execution parallel concurrent 11:23 execution of transactions is 11:24 serializable if there exists some serial 11:28 order really emphasizing serial here a 11:31 serial order of execution of those same 11:34 transactions that yields the same result 11:37 as the actual execution and the 11:39 difference of here is the actual 11:40 execution may have had a lot of 11:41 parallelism in it but it's required to 11:46 produce the same result as some one at a 11:48 time 11:49 execution of the same transactions and 11:52 so the way you check whether an 11:54 execution is serializable whether some 11:56 concurrent execution is serializable is 11:59 you look at the results and see if you 12:01 can find actually some one at a time 12:03 execution of the same transactions that 12:06 does produce the same results so for our 12:09 transaction up here there's only two 12:14 orders there's only two one at a time 12:16 serial orders available transaction 1 12:19 then transaction 2 or transaction 2 then 12:22 transaction 1 and so we can just look at 12:25 the results that they would produce if 12:27 executed one at a time in each of these 12:29 orders so if we execute t1 and then t2 12:34 then we get x equals 11 12:42 why equals 9 and this print statement 12:46 since t1 executed first this print 12:49 statement sees these two updated values 12:51 and so it will print the string 11 9 the 12:58 other possible order is that perhaps t2 13:01 ran first and then t1 and in that case 13:06 t2 will see that 2 records before they 13:09 were modified but the modifications will 13:11 still take place since t1 runs later so 13:14 the final results will again be x equals 13:16 11 y equal 9 but this time t2 sodded 13:22 before our values so these are the two 13:27 legal results for serializability and if 13:33 we ever see anything else from running 13:34 these two transactions at the same time 13:36 we'll know that the database were 13:38 running against does not provide 13:39 serializable execution it's doing 13:42 something else and so while we're 13:45 thinking through what would happen if or 13:48 what would happen if will always be 13:50 against these AHA these are the only two 13:52 legal results we better be doing 13:55 something that produces one or the other 13:56 it's interesting to note that there's 13:59 more than one possible result depending 14:02 on the actual order you if you you 14:04 submit these two transactions at the 14:06 same time you don't know whether it's 14:08 gonna be t1 t2 or t2 t1 so you have to 14:11 be willing to expect more than one 14:13 possible legal result and as you have 14:15 more or transactions running 14:16 concurrently a more complicated there 14:18 may be many many possible different 14:20 correct results that are all 14:22 serializable because of many many orders 14:25 here that could be used to fulfill this 14:28 requirement okay so now that we have a 14:34 definition of correctness and we even 14:35 know what all the possible results are 14:37 we can ask a few questions so few 14:42 what-if questions about how these could 14:44 execute so for example suppose that the 14:48 way the system actually executed this 14:50 was that it started transaction 2 and 14:53 got as far as 14:55 just after reading X and then 14:58 transaction one ran at this point and 15:03 then after transaction one finished 15:05 transaction to continue executing now it 15:11 turns out in with different other 15:13 transactions than this that might 15:15 actually be legal but here we want to 15:18 know if it's legal so we're wondering 15:20 gosh if we actually executed that way 15:22 what results will we get and are they 15:23 the same as either of these two well if 15:27 we execute transaction one here then t1 15:29 is gonna see value 10 t2 is gonna see 15:32 the value after decrementing Y so t1 15:35 will be 10 t2 will be 9 and what this 15:38 print will be 10 9 and that's neither of 15:42 these two outputs here so that means 15:45 executing in this way that I just drew 15:47 is not serializable it would not be 15:49 legal another interesting question is 15:55 what if we started executing transaction 15:57 1 and we got as far as just after the 15:59 first ad and then at that point all the 16:02 transaction 2 executed right here so 16:08 that would mean at this point X is value 16:10 11 the transaction 2 would read 1110 now 16:17 print 1110 and 1110 is not one of these 16:20 two legal values so this execution is 16:22 also not legal for these two 16:23 transactions 16:35 so the reason why serializable 16:39 serializability is a popular and useful 16:43 definition of what it means for 16:44 transactions to be correct for execution 16:47 of transactions to be correct is that 16:48 it's a very easy model for programmers 16:51 you can write complicated transactions 16:53 without having to worry about what else 16:56 may be running in the system there may 16:57 be lots of other transactions may be 16:59 using the same date as you may be 17:00 reading trying to read and write it at 17:02 the same time there might be failures 17:04 who knows but the guarantee here is that 17:10 it's safe to write your transactions as 17:12 if nothing else was happening because 17:15 the final results have to be as if your 17:19 transaction was executed by itself in 17:22 this one-at-a-time order which is a very 17:24 simple very nice programming model it's 17:28 also nice that this definition allows 17:31 truly parallel execution of transactions 17:34 as long as they don't use the same data 17:36 so we run into trouble here because 17:38 these two transactions are both reading 17:39 x and y but if they were using 17:41 completely disjoint database records 17:44 they could it turns out this definition 17:46 allows you to build a database system 17:48 that would execute transactions to use 17:51 disjoint data completely in parallel and 17:54 if you are a sharded system which is 17:56 what we're sort of working up to today 17:57 with the data different data is on 17:59 different machines you can get true 18:01 parallel speed-up because maybe one 18:02 transaction executes Spira in the first 18:04 shard on the first machine and the other 18:06 in parallel on the second machine so 18:09 there are opportunities here for for 18:12 good performance before I dig into how 18:17 to implement serializable transactions 18:21 there's one more small point I want to 18:24 bring up it turns out that one of the 18:29 things we need to be able to cope with 18:30 is that transactions may for one reason 18:33 or another 18:34 basically fail or decide to fail in the 18:39 middle of the transaction and this is 18:41 usually called an abort and you know for 18:47 many transaction systems we need to be 18:48 prepared to handle Oh what should happen 18:50 if a transaction tries to access a 18:53 record that doesn't exist or divides by 18:56 zero or maybe you know since some 18:59 transaction implementation schemes use 19:01 locking maybe a transaction causes a 19:03 locking deadlock and the only way to 19:05 break the deadlock is to kill one of one 19:08 or more of the transactions this 19:09 participating in the deadlock so one of 19:13 the things that's going to be kind of 19:14 hanging in the background and will come 19:16 up is the necessity of coping with 19:18 transactions that all of a sudden in the 19:20 middle decide they just cannot proceed 19:22 and you know maybe really in the middle 19:26 after they've done some work and started 19:28 modifying things we need to be able to 19:30 kind of back out of these transactions 19:33 and undo any modifications they've made 19:35 all right 19:38 the implementation strategy for 19:40 transactions for these asset 19:42 transactions I'm gonna split into two 19:46 big pieces but and talk about both of 19:48 them the main topics in the lecture the 19:52 first big implementation topic is 19:54 concurrency control this is the main 20:05 tool we use to provide serializability 20:07 the current or isolation so concurrency 20:10 control bias 20:16 by its isolation from other concurrent 20:19 transactions that might be trying to use 20:21 the same data and the other big pieces I 20:24 mentioned is atomic commit and this is 20:28 what's going to help us deal with the 20:31 possibility that oh yeah this 20:34 transactions executing a long and it's 20:35 may be modified X and then all of a 20:38 sudden there's a failure and one of the 20:41 server's involved but other servers that 20:44 were maybe actually in other parts of 20:45 the transaction that is if x and y are 20:48 in different machines we need to be able 20:50 to recover even if there's a partial 20:52 failure of only some of the machines the 20:55 transactions running off and the big 20:58 tool people use for that is this atomic 21:00 commit you'll talk about all right so 21:03 first concurrency control there's really 21:08 two classes two major approaches to 21:11 concurrency control I'll talk about both 21:13 during the course if they're just mean 21:20 strategies the first strategy is a 21:22 pessimistic usually called pessimist 21:29 pessimistic concurrency control and this 21:32 is usually locking we've all done 21:34 locking in the labs in the context of go 21:36 program so it turns out databases 21:38 transaction processing systems also used 21:40 locking and the idea here is U is the 21:45 same as well you're quite familiar with 21:46 this that before transaction uses any 21:48 data it needs to acquire a lock on that 21:50 data and if some other transactions 21:52 already using the data the lock will be 21:54 held and we'll have to wait before we 21:57 can acquire the lock wait for the other 21:58 transaction to finish and in pessimistic 22:02 systems if there's locking conflicts 22:04 somebody else has the lock it'll cause 22:06 delays so you're sort of treating 22:09 performance for correctness the other 22:14 main approach is optimistic approaches 22:21 the basic idea here is you don't worry 22:23 about whether maybe some other 22:25 transactions reading or writing the data 22:26 at the same time as you you just go 22:28 ahead and do whatever reads and writes 22:30 you're gonna do although typically into 22:32 some sort of temporary area and then 22:33 only at the end you go and check whether 22:37 actually maybe some other transaction 22:38 might have been interfering and if 22:40 there's no other transaction now you're 22:42 done and you never had to go through any 22:44 of the overhead or weighting of taking 22:46 out locks the locks are reasonably 22:47 expensive to manipulate but if somebody 22:51 else was modifying the data in a 22:54 conflicting way at the same time you 22:56 were then you have to abort that 22:58 transaction and we try and the 23:05 abbreviation for this is often 23:06 optimistic concurrency control um it 23:10 turns out that under different 23:11 circumstances these two strategies one 23:12 can be faster than the other 23:15 if conflicts are very frequent you 23:17 probably actually want to use 23:18 pessimistic concurrency control not 23:20 because of conflicts are frequent you're 23:22 gonna get a lot of aborts due to 23:23 conflicts for optimistic seems if 23:25 complex are rare than optimistic 23:27 concurrency control can be faster 23:29 because it completely avoids locking 23:31 overhead today will be all about 23:33 pessimistic concurrency control and then 23:36 some later paper in particular farm in a 23:39 couple weeks we'll deal with an 23:41 optimistic scheme okay so today talking 23:48 about pessimistic schemes refers 23:51 basically to locking and in particular 23:53 for today the reading was about 23:54 two-phase locking which is the most 23:57 common type of locking 24:07 and the idea in two-phase locking for 24:10 transactions is that transactions gonna 24:12 use a bunch of Records like X&Y and our 24:14 example the first rule is that you 24:19 acquire a lock before using date any 24:25 piece of data we're reading or writing 24:30 any record and the second rule for 24:35 transactions is that a transaction must 24:37 hold any locks it acquires until after 24:40 it commits or aborts you're not allowed 24:43 to give up locks in the middle of the 24:44 transaction you have to hold them all 24:46 you can only accumulate them until 24:48 you're done until after you're done so 24:54 until Phoebe done so this is two-phase 24:59 locking the phases are the phases which 25:01 we acquire locks and then phase in which 25:03 we just hold onto them until we're done 25:07 so for two phase locking to sort of see 25:11 why locking works your typical locking 25:15 systems well there's a lot of variation 25:17 typical locking systems associate a 25:19 separate lock with each record in the 25:21 database with each row in each table for 25:23 example although they can be more more 25:25 coarse-grained these transactions start 25:28 out holding no locks let's say 25:29 transaction one starts out holding no 25:31 locks when it first uses X before so 25:34 I'll have to use it it has to acquire 25:35 the lock on X and it may have to wait 25:38 and when it first uses Y it acquires 25:41 another lock the lock on Y when it 25:43 finishes after it's done becoming these 25:45 both if we ran both these transactions 25:48 at the same time they're gonna basically 25:50 race to get the lock on X and whichever 25:53 of them gets the managed to get the lock 25:56 on X first it will proceed and finish 25:59 and commit meantime the other 26:02 transaction that didn't manage to get 26:04 the lock on X first it's going to see if 26:05 you're waiting before it what you does 26:08 anything with accent OA can acquire the 26:10 lock so transaction 2 actually got the 26:13 lock first 26:14 you would get the value of X get the 26:16 value of y cuz transaction one hasn't 26:20 gotten at this point hasn't locked Y yet 26:22 it'll print and it will finish and 26:24 release its locks and only then 26:26 transaction one will be able to acquire 26:28 the lock on X and as you can see that 26:30 basically forces a serial order because 26:33 it forced in this case it force the 26:35 order T two and then when T two finishes 26:38 only then T 1 so with it's explicitly 26:43 forcing an order which causes the that 26:47 execution to follow the definition of 26:49 serializability that you know really is 26:51 executing T 2 to completion and only 26:54 then T 1 so we do get correct execution 27:06 all right so one question is why you 27:15 need to hold the locks until the 27:16 transactions completely finished you 27:19 might think that you could just hold a 27:22 lock while you are actually using the 27:24 data and that would be more efficient 27:25 and indeed it would that is you know 27:28 maybe only hold the lock for the period 27:31 of time in which t2 is actually looking 27:34 at record X or maybe only hold the lock 27:36 on X here for the duration of the add 27:39 operation and then immediately release 27:41 it and in that case that what if we 27:43 transaction one immediately released a 27:45 lock on X there there by disobeying this 27:47 rule of course but if it immediately 27:48 release the lock on X then transaction 27:50 two might be able to start a little bit 27:51 earlier we get more concurrency more 27:53 higher performance so this rule 27:55 definitely you know bad for performance 27:57 so we want to make pretty sure that it's 28:00 it's good for that's required for 28:02 correctness 28:05 so what won't happen if transactions did 28:08 actually release locks as early as 28:11 possible 28:12 so suppose t2 here reads X and then 28:15 immediately releases this lock on X that 28:20 would allow t1 since at now at this 28:23 point in the execution t2 doesn't hold 28:26 any locks because it's just released it 28:28 illegally release the lock on X since it 28:31 holds a no locks that means t1 could 28:33 completely execute right here and we 28:36 already knew from from before that this 28:40 interleaving is not correct as it 28:42 doesn't produce either these two outputs 28:45 similarly if if t1 released this lock on 28:52 X after finished adding one to X that 28:55 would allow all of t2 to slip in right 28:57 here and we know also from before that 28:59 that results in in illegal results 29:07 there's a an additional kind of problem 29:12 that can come up with releasing locks 29:14 after modifying data if t1 were to 29:18 release the lock on X it might allow t2 29:21 to see the modified version of X here to 29:24 see the X after adding 1 to it and to 29:26 print that output and then for tteyuu to 29:28 complete after printing the incremented 29:31 value of x if transaction one were to 29:33 abort after that point maybe because 29:36 bank balance Y doesn't exist or maybe 29:39 bank bonds Y exists but its balance is 29:41 zero and you know we're not allowed to 29:43 decrement 0 for bank balances because 29:46 that's an overdraft so t1 might modify X 29:48 then abort and part of the abort has to 29:51 be undoing its update to X in order to 29:56 maintain atomicity and what that would 29:59 mean if it released the locks is that 30:00 transaction 2 would have seen this sort 30:03 of phantom value of 11 that went away 30:05 because t1 aborted you would have seen a 30:08 value that according to the rules never 30:10 existed right because then the 30:13 transaction 1 aborts then it's as if it 30:16 never existed and so that means the 30:18 results from t2 had better be as if t2 30:21 ran by itself without t1 at all but if 30:24 it sees the increment that it's gonna 30:26 print 11 for X 11 10 actually which is 30:31 just doesn't correspond to any state in 30:34 the database given that t1 didn't really 30:37 complete okay so that's why those are 30:42 two dangers that are averted due to 30:45 violations serialize ability that are 30:48 averted because transactions hold the 30:50 locks until they're done a further thing 30:56 to note about these rules or that it's 30:59 very easy for them to produce deadlock 31:01 so you know for example if we have two 31:06 transactions one of them reads record ax 31:12 and reads record y 31:15 and the other transaction reads Y and 31:19 then X that's that's just a deadlock if 31:26 they run at the same time they each of 31:28 them gets this lock on the record it 31:32 first read they don't release till the 31:34 transactions finish so they both sit 31:37 there waiting for the lock that's held 31:39 by the other transaction and unless the 31:41 database does something clever which it 31:42 will 31:44 they'll deadlock forever and in fact 31:46 transactions have various strategies 31:47 including tracing cycles or timeouts in 31:50 order to detect that they've gone into 31:53 the situation the database will abort 31:54 one of these two transactions and undo 31:56 all its changes and act as if that 31:58 transaction that never occurred okay so 32:02 that's concurrency control with 32:04 two-phase locking and this is just 32:12 completely standard database behavior so 32:16 far and it's the same in a single 32:22 machine databases as it will be and 32:24 distributed databases that are a little 32:26 more interest to us but our next topic 32:30 is a little is actually specific to 32:32 building databases or storage systems in 32:35 general that support transactions on 32:39 distributed setting that is splitting 32:41 the data over multiple machines so now 32:45 the topic is how to build distributed 32:47 distributed transactions and in 32:53 particular how to cope with failures and 32:56 more specifically the kind of partial 32:58 failures of just one of many servers 33:00 that you often see in distributed 33:02 systems so beyond distributed 33:04 transactions and we're worried about how 33:07 they behave you make sure they're 33:09 serializable and also have sort of 33:13 all-or-nothing ad Amissah T even in the 33:15 face of failures so 33:21 you know I you know what the way this 33:24 looks like is that we may have two 33:26 servers and we got server one and maybe 33:30 it stores record X in our bank and we 33:33 have server two and maybe it's stores 33:35 record Y so they all start out with 33:37 value 10 and we need to run these two 33:41 transactions that transaction 1 of 33:44 course modifies both x and y so now we 33:48 need to send messages the database is 33:49 saying oh please increment X please 33:51 decrement Y but it would be easy if we 33:55 weren't careful to get into a situation 33:56 where we had told server 1 to increase 33:59 the balance for X but then something 34:01 failed maybe the client sending the 34:03 requests or maybe server the server - 34:05 that's holding Y fails or something and 34:07 we never managed to do the second update 34:10 right so that's one problem is failure 34:14 somewhere may sort of cut the 34:16 transaction in half and if we're not 34:19 careful cause only half of the 34:20 transaction to actually take effect 34:34 this can happen even without crashes if 34:36 X does its part in the transaction it 34:39 could be that over on server-to-server 34:40 to actually gets the request to 34:42 decrement bank account y but maybe 34:46 server 2 discovers this bank account 34:47 doesn't exist or maybe it does exist and 34:50 it's balance is already 0 when it can't 34:52 be decrease and so it can't do its part 34:53 of the transaction but X look has 34:56 already done its part of the transaction 34:58 so that's a problem that needs to be 35:00 dealt with so the the property we want 35:09 as I mentioned before is that all the 35:11 pieces of the system either all the 35:13 pieces of the system should do their 35:15 part of the transaction or none right so 35:18 you know the kind of the thing we 35:20 violated here is what atomicity against 35:25 crashes versus failure where atomicity 35:33 is all or not all parts all parts of the 35:40 transaction that we're trying to execute 35:42 or none of them and for you more the 35:50 kind of solution we're going to be 35:51 looking at is atomic commitments atomic 35:54 commit protocols and the general kind of 35:59 flavor of atomic commit protocols is 36:01 that you have a bunch of computers 36:02 they're all doing different parts of 36:04 some larger task and the atomic commit 36:08 protocol is gonna help the computers 36:10 decide that either they're all going to 36:12 do they're they're all capable of doing 36:13 their part and they're actually gonna do 36:15 it or something has gone wrong and 36:17 they're all going to agree that oh 36:19 they're actually none of them are gonna 36:21 do their part of the whatever the 36:23 overall task is and the big challenges 36:26 are of course how to cope with various 36:28 failures machine failures loss of 36:29 messages and it'll turn out that 36:32 performance is also a little bit 36:35 difficult to do a good job with the 36:39 specific protocol we're gonna look at 36:40 and is the protocol explained in a 36:42 reading for today our two-phase commit 36:52 this is an atomic commitment protocol 36:58 and this is used both by distributed 37:00 databases and also by all kinds of other 37:02 distributed systems that might not have 37:05 first looked like traditional databases 37:07 the general setting is we assume that 37:10 that in one way or another the task we 37:13 need to perform is split up over 37:15 multiple servers each of which needs to 37:16 do some part a different part each one 37:19 of them so for example because I'm set 37:22 up I showed here in which the it's 37:24 really the data that split up and so the 37:26 tasks being split up our incrementing X 37:28 and decrementing Y D we're going to 37:34 assume that there's one computer that's 37:38 driving the transaction called the 37:40 transaction coordinator there's lots of 37:55 ways of arranging how the transaction 37:57 coordinator steps in but we'll just 37:59 imagine it as a computer that is 38:00 actually running the transaction there's 38:03 one computer the transaction coordinator 38:04 that's that's executing the sort of code 38:06 for the transaction like the puts and 38:08 the gets and the adds and it sends 38:11 messages to the computers that hold the 38:14 different pieces of data that need to 38:16 actually execute the different parts so 38:18 for our setup we're going to have one 38:21 computer of the transaction coordinator 38:23 and it's going to be these server one 38:28 and server two that hold X&Y transaction 38:33 coordinator we'll send a message to 38:34 server one saying oh please increment X 38:36 send a message to server Y saying oh 38:38 please decrement Y and then there'll be 38:40 more messages in order to make sure that 38:42 either they both do it or neither than 38:44 do it and that's where two-phase commit 38:46 steps in something to keep in the back 38:50 your mind is that in the full system 38:52 there may be many different transactions 38:53 running concurrently and many 38:55 transaction coordinators 38:57 sort of executing their own transactions 39:00 and so the various parties here need to 39:03 keep track of oh you know this is a 39:04 message for such-and-such a transaction 39:06 and where they keep state like these 39:09 turns out these servers are going to 39:10 maintain table two blocks for example 39:12 and they keep state like that they need 39:14 to keep track of oh this is a lock 39:15 that's being held for transactions 17 so 39:18 there's a notion of transaction IDs and 39:28 I'm just gonna assume although you know 39:31 I'm not actually show it that every 39:33 message in the system is tagged with the 39:35 transaction with the unique transaction 39:37 ID of the transaction it applies to and 39:39 these IDs are chosen by the transaction 39:41 coordinator when the transaction starts 39:43 the transaction coordinator will send 39:44 out oh this is a message for transaction 39:47 1995 and it'll keep all its state here 39:51 about the transaction will be tagged 39:52 with 95 and the various tables in the 39:57 different participants in the 39:59 transaction will be tagged with the 40:01 transaction IDs and so that's another 40:04 piece of terminology we got the 40:05 transaction coordinator and then the 40:07 other servers that are doing parts of 40:11 the transaction are called participants 40:20 all right 40:21 so let me draw out the two-phase commit 40:24 protocol example execution so this is 40:28 abbreviate this to PC for two-phase 40:32 commit the parties involved are the 40:37 transaction coordinator and we'll just 40:40 say there's two participants that is you 40:42 know maybe we're executing the 40:43 transactions I've shown next and why 40:44 aren't different servers maybe we've got 40:48 participant a and participant B these 40:53 are two different servers holding data 40:57 so the transaction coordinator it's 40:59 running the whole transaction it's it's 41:01 gonna send puts and gets to a and B to 41:03 tell them to you know read the value of 41:06 x or y or add one to X so we're going to 41:09 see at the beginning of the tree 41:11 action that the transaction coordinator 41:12 is sending for example maybe a get 41:15 requests to Trent participant a and it 41:19 gets a reply and then maybe it sends 41:21 that put for whatever I might see a long 41:27 sequence of these if there's a 41:29 complicated transaction then when 41:33 transaction coordinator gets to the end 41:35 of the transaction and wants to commit 41:38 it and be able to you know release all 41:40 those locks and make the transactions 41:42 results visible to the outside world and 41:44 maybe reply to a client or a human user 41:47 so they were assuming there's a sort of 41:49 external client or human that said oh 41:52 please run this transaction and it's 41:54 waiting for a response before we can do 41:56 any of that the transaction coordinate 41:59 coordinator has to make sure that all 42:02 the different participants can actually 42:04 do their part of the transaction and in 42:07 particular if there were any puts in the 42:08 transaction we need to make sure that 42:11 the participants who are doing those 42:14 puts well are actually still capable of 42:16 doing the puts so in order to find that 42:19 out the transaction coordinator sends 42:22 prepare messages to all of the 42:32 participants so we're going to send pair 42:35 messages to both a and B 42:41 and when a or B would receive a preparer 42:44 message you know they know the 42:45 transaction is nearing completion but 42:47 not not over yet 42:49 they look at their state and decide 42:51 whether they are actually able to 42:52 complete the transaction you know maybe 42:54 they needed to abort it break a deadlock 42:56 or maybe they've crashed and we started 42:58 but between you know when they did the 43:02 last operation are now and they've 43:04 completely forgotten about the 43:05 transaction and can't complete it so a 43:07 and B you know look at their state and 43:08 say oh I'm going to be able to or I'm 43:10 not gonna be able to do this transaction 43:11 and they respond with either yes or no 43:24 so the transaction coordinator is 43:28 waiting for these yes or no votes from 43:31 each of the participants if they all say 43:35 yes then the transaction can commit 43:42 nothing goes wrong the transaction can 43:45 commit and the transaction coordinator 43:47 sends out a commit message to each of 43:57 the participants and then the 44:02 participants usually reply with an 44:05 acknowledgement saying yes we now know 44:07 the outcome this is called the echnology 44:10 all right so they all transaction 44:14 coordinator since I preparers if all the 44:17 participants say yes it can commit if 44:19 anyone in any of them even a single one 44:21 says no actually I cannot complete this 44:24 transaction because I had a failure or 44:27 there was an inconsistency like a 44:29 missing record and I have to abort even 44:32 a single participant says no at this 44:34 point then the transaction coordinator 44:36 won't commit it'll send out a round of 44:38 abort messages saying oops please 44:41 retract this transaction either way the 44:47 after the commit sort of to two things 44:51 happen of interest to us 44:52 one is the transaction coordinator will 44:54 mint whatever the transactions output is 44:57 to the client or human that requested it 44:59 and say look oh yes the transactions 45:00 finish and so now if it didn't abort a 45:03 committed it's durable 45:04 the other interesting thing is that in 45:07 order to obey these locking rules the 45:12 participants unlock when they see either 45:15 commit or an abort and indeed in order 45:23 to obey the two phase locking rule each 45:27 participant locked any data that it read 45:32 as part of doing its part of the 45:33 transaction so we're imagining that in 45:35 each participant there's a table of the 45:37 locks associated with the data stored at 45:40 that participant and the participant 45:43 sort of lock things in those tables 45:44 remember oh this is you know this piece 45:47 of data this record is locked for 45:49 transaction twenty nine and one finally 45:51 the commit or abort comes back versions 45:52 action twenty-nine the participant 45:55 unlocks that data and then other 45:57 transactions can use so we may have to 45:59 wait here and this unlock may unblock 46:02 other transactions that's really part of 46:06 the serializability machinery so you 46:13 know so far the reason why this is 46:14 correct basically is that the if 46:19 everybody's following this protocol 46:20 there's no failures then the two 46:23 participants only commit if both of them 46:25 commit and if I them can't commit if 46:29 I've them has to abort then they both 46:30 abort so we get that either they all do 46:33 it or none of them do it result that we 46:36 wanted the atomicity result with this 46:40 protocol so far without without thinking 46:43 about failures and so now our job is to 46:47 think through in our head all sort of 46:49 the different kinds of failures that 46:50 might occur and figure out whether the 46:53 protocol still provides atomicity either 46:56 both do it or neither do it in the face 46:59 of these failures and how we have to 47:01 adjust or extend the protocol in order 47:05 to cause it to do the right thing so the 47:07 first thing I want 47:08 consider is what it be crashes and 47:11 restarts 47:11 I mean power failure or something be 47:15 just some suddenly stops executing and 47:17 then powers restored and it's brought 47:20 back to life and run some maybe some 47:22 sort of recovery software as part of the 47:26 transaction processing system well 47:28 there's really two scenarios we have to 47:32 worry about one is B might have crashed 47:35 before ascending it's yes message back 47:41 so B crash before sending its yes 47:44 message back then it never said yes so 47:48 the transaction coordinator couldn't 47:50 possibly have committed or be about to 47:53 commit because it has to wait for a yes 47:55 from all participants so if B can 47:57 convince itself that it could not 47:58 possibly have sent a yes back that is a 48:02 crash before sending the yes then B is 48:04 entitled to unilaterally abort the 48:06 transaction itself and forget about it 48:09 because it knows the transaction 48:11 coordinator can't possibly commit it so 48:15 [Music] 48:18 there's you know a number of ways of 48:19 implementing this one possibility is 48:21 that all of these information about 48:23 transactions that haven't reached this 48:25 point is in memory and it simply lost it 48:27 B crashes and reboots so B just won't 48:30 know anything about transactions that 48:31 haven't haven't sent yes back yet and 48:35 then if the transaction coordinator 48:37 sends a prepare message to a participant 48:39 that doesn't know anything about the 48:41 transaction because it crashed before 48:42 sending yes the the parties will say no 48:45 no I cannot possibly agree to that you 48:47 know please abort 48:51 okay but of course maybe B crashed after 48:55 sending a yes back so that's a little 49:00 more tricky so wasn't in the crash 49:02 this wasn't a B gets a prepare its it's 49:05 happy it says yes I'm going to commit 49:07 and then it crashes before it gets the 49:09 commit message from the transaction 49:12 employer coordinator well now we had 49:14 we're in a totally different situation B 49:16 is promised to commit if told to do so 49:19 because the send a yes back and for all 49:21 knows and indeed the most likely thing 49:23 that's happening is the transaction 49:24 coordinator got yeses from a and B and a 49:26 sent a commit message to a so that a 49:28 actually will do its part of the 49:31 transaction and make it permanent and 49:32 release locks and in that case in order 49:35 to honor all or nothing we're absolutely 49:37 required it B should crash at this point 49:39 that on recovery that it be still 49:42 prepared to complete its part of the 49:44 transaction 49:45 it doesn't actually know at that point 49:46 whether you know because it hasn't 49:48 received the committee ette and whether 49:50 it should commit or not but it must 49:51 still be prepared to commit and what 49:53 that means the fact that we can't lose 49:57 the state for a transaction across 49:59 crashes and reboots 50:01 is that before B replies to a prepare it 50:07 must make the transaction state this 50:13 sort of intermediate transaction state 50:14 the memory of all of the changes that's 50:16 made which may have to be undone if 50:17 there's an abort plus the record of all 50:20 the locks the transactions how it held 50:22 it must make that durable on disk in 50:26 between it's almost always in a log on 50:28 disk so before B replies yes before B 50:33 sends the s4 in reply to a prepare 50:35 message it first must write to disk in 50:39 its log all the information required to 50:42 commit that transaction that is all the 50:44 new values produced by put plus a full 50:48 list of locks on the disk or some other 50:51 persistent memory before applying with 50:53 yes and then if there should be if it 50:55 B's your crash after sanity yes that's 50:58 part of recovery when it restarts that a 50:59 look at his it's log and say oh gosh I 51:01 was in the middle of a transaction I had 51:03 replied yes for transaction 92 I mean 51:06 you know here's all the modifications it 51:07 should make if committed and all the 51:09 locks it held 51:10 I better restore that state and then 51:13 when he finally gets a commitment nor an 51:15 abort it'll know from having read its 51:17 log how to actually finish its part of 51:20 the transaction so so this is an 51:23 important thing I left out of the 51:24 original laying out of this protocol is 51:29 that B must write to its disk at this 51:32 point 51:34 and this is part of what makes two-phase 51:36 commit a little bit slow is that there's 51:39 these necessary persisting of 51:41 information here okay so we also have to 51:47 worry about okay and you know the final 51:50 place I guess where you might crash is 51:51 you might crash be my crashed after 51:54 receiving the commit or or after both 51:58 you might crash after actually 52:00 processing the commit and but in that 52:02 case it's made modifications that the 52:06 transaction means to make permanent in 52:08 its database presumably also on disk 52:12 before after it received a commit 52:15 message and in that case there's maybe 52:16 not anything to do if it restarts 52:18 because the transaction is finished so 52:20 when B receives the commit message it 52:23 probably writes the copies the 52:28 modifications from its log on to its 52:29 permanent storage releases this locks 52:32 erases the information about the 52:34 transaction of months log and then 52:36 replies and of course we have to worry 52:38 about you know what if it receives a 52:40 commit message twice probably the right 52:43 thing to do is either for B to remember 52:45 about the transaction that takes memory 52:48 so it turns out that it B simply forgets 52:51 about committed transactions that it's 52:53 made durable on disk it can reply to a 52:56 repeated commit message if it doesn't 52:58 know anything about that transaction by 53:00 simply acknowledging it again and 53:03 that'll be an important a little bit 53:04 later on ok so that's the story of one 53:08 of the participants crashes at various 53:10 awkward points what about the 53:12 transaction coordinator it's also just a 53:14 single computer sorry you know if it 53:16 fails might be a problem okay so again 53:26 the critical where things start getting 53:29 critical is if any party might have 53:32 committed then we cannot forget about 53:36 that if any either of these participants 53:39 might have committed or if the 53:41 transaction coordinator might have 53:43 replied to the client then we cannot 53:47 have that transaction go away right if a 53:50 is committed but maybe its transaction 53:52 the coordinator sent out a commit 53:54 message to a but hadn't gotten around to 53:56 sending a commitment to be the crashes 53:58 at that point the transaction 53:59 coordinator must be prepared on restart 54:02 to resend the commit messages to make 54:05 sure that both parties know that the 54:08 transaction is committed so okay so you 54:14 know whether that matters depends on 54:16 where the transaction coordinator 54:17 crashes if the crash is before sending 54:20 commit messages it doesn't really matter 54:21 neither party if you know since the 54:24 transaction coordinator didn't send 54:26 commit messages before crashing it can 54:29 just abort the transaction and if either 54:33 participant asks about that transaction 54:35 because they you know see it's in their 54:36 log but they never got a commit message 54:38 the transaction coordinator can say I 54:40 don't know anything about that 54:41 transaction it must have been aborted 54:43 possibly due to a crash so that's what 54:46 happens if the transaction coordinator 54:47 crashes before the commit but if a 54:50 crashes after sending one or more 54:52 commits message then it cannot defends 54:59 action coordinator can't be allowed to 55:02 forget about the transaction and what 55:05 that means is that at this point when 55:08 that after the transaction coordinator 55:09 it's made its commit versus abort 55:11 decision on the basis of these yes/no 55:13 votes before sending out any commit 55:16 messages it must first write information 55:20 about the transaction to its login in 55:22 persistent storage like a disk that will 55:26 still be there if it crashes and 55:27 restarts so transaction coordinator 55:30 after receives a full set of yeses or 55:32 noes writes the outcome and the 55:35 transaction ID to its log on disk and 55:38 only then it starts to send out commit 55:40 messages and that way if a crash is at 55:41 any point maybe before its end the first 55:45 commit message or after its sent one or 55:47 maybe even after sent all of them if it 55:49 crashes that point its recovery software 55:51 will see in the log AHA which is in the 55:53 middle of a transaction the transaction 55:55 was either known to have been committed 55:57 or aborted 55:59 and as part of recovery it will resend 56:01 commit messages to all the participants 56:04 or abort messages whatever the decision 56:06 was in case it hadn't sent them before 56:10 it crashed and that's one reason why the 56:12 participants have to be prepared to 56:14 receive duplicated commit messages okay 56:27 so there's some other so those are the 56:31 main crash stories we also have to worry 56:34 about what happens if messages are lost 56:35 in the network you might send a message 56:37 maybe the message never got there you 56:39 might send a message and be waiting for 56:40 a reply maybe the reply was sent but the 56:44 reply was dropped so any one of these 56:45 messages may be dropped and need to 56:47 think through what to actually do in 56:52 each of these cases so for example 56:56 supposing the transaction coordinator 56:57 sent out prepare messages but hasn't 57:00 gotten some of the yes or no replies 57:02 from participants what are the 57:04 transaction coordinators options at that 57:06 point well one thing I could do is send 57:08 out a new set of prepare messages saying 57:11 you know I didn't get your answer please 57:13 tell me your answer yes or no and you 57:15 know I could keep on doing that for a 57:17 while but if one of the partisans is 57:20 down for a long time we don't want to 57:21 sit there waiting with locks held right 57:24 because you know supposing a is 57:27 unresponsive but but B is up but because 57:30 that we haven't committed or aborted B 57:31 is still holding locks and that may 57:33 cause other transactions to be waiting 57:35 so we don't want to wait forever if we 57:37 can possibly avoid it so if the 57:39 transaction coordinator hasn't gotten 57:41 yes or no responses after some amount of 57:43 time from the participants then it can 57:47 simply unilaterally decide we're gonna 57:49 abort this transaction because it knows 57:52 since it didn't get a full set of yes or 57:54 no messages of course that can't 57:55 possibly have sent a commit yet so no 57:57 participant could have committed so it's 58:00 always valid to abort if the transaction 58:03 coordinator hasn't yet committed so the 58:05 transaction coordinator times out 58:07 waiting for yes or no x' this messages 58:09 were lost or somebody crashed or 58:11 something 58:12 it can just decide alright we're 58:13 aborting this transaction we'll send out 58:15 a round of abort messages and if some 58:17 participant comes back to life and says 58:19 oh you know I didn't hear back from you 58:21 about transaction 95 the transaction 58:25 coordinator will see you oh well I don't 58:26 know anything about transaction 95 58:28 because it aborted it and erased its 58:30 State for that transaction and it will 58:32 tell the participant you know you should 58:35 abort this transaction too similarly if 58:42 one of the participants times out 58:44 waiting for the preparer here then you 58:47 know for participant hasn't received a 58:49 preparer that means it hasn't send a yes 58:51 message back and that means the 58:53 coordinator can't possibly have sent any 58:54 commit messages 58:55 so if participant chimes out here 58:58 waiting for the preparer it's also 58:59 always allowed to just bail out and 59:03 decide to abort the transaction and if 59:05 it's some future time the transaction 59:07 coordinator comes back to life and sends 59:09 out preparer messages then B will say no 59:11 I don't know anything about that 59:12 transaction so I'm voting no and that's 59:15 okay because it can't possibly have 59:16 committed started to commit anywhere so 59:19 again if something goes wrong with the 59:21 network or the transaction coordinator 59:22 is down for a while 59:24 and the participants are still waiting 59:26 for prepares it's always valid for 59:29 participants to abort and thereby 59:31 release the locks that other 59:32 transactions may be waiting for and that 59:34 can be very important in a busy system 59:39 so that's the good news about if the 59:44 participants or the transaction 59:45 coordinators time out waiting for 59:47 messages from the other parties however 59:52 suppose participant B has received a 59:56 preparer and sent its yes and so is in 60:00 somewhere around here but it hasn't 60:01 received a commit and it's waiting and 60:03 waiting and it hasn't gotten to commit 60:05 back maybe something's wrong with the 60:06 network maybe the transaction 60:08 coordinator is its network connection 60:10 has fallen out or its powers failed or 60:13 something but for whatever reason B is 60:14 waited a long time and it still hasn't 60:15 heard a commit now but it's sitting 60:18 there holding locks is still holding on 60:19 to those locks for all the records that 60:21 were used and it's part of the 60:22 transaction and that means other 60:24 transactions may be also 60:25 blocked waiting for those locks to be 60:27 released so we're like pretty eager to a 60:30 border if we possibly can or release the 60:32 locks and so the question is if B has 60:35 received prepare and replied with yes 60:37 isn't entitle to unilaterally abort 60:40 after it's waited say you know 10 60:42 seconds or 10 minutes or something to 60:45 get the commit message and the answer to 60:48 that unfortunately is no in this region 60:54 after receiving the prepare we're out 60:56 really after sending the yes and before 60:58 getting the commit it's your time out 61:01 waiting for the commit you're not 61:06 allowed to abort you must keep waiting 61:08 you must usually called block so in this 61:12 region of the protocol if you don't 61:14 receive the commit you have to wait 61:15 indefinitely and the reason is that 61:17 since be sent back a yes that means the 61:21 transaction coordinator may have 61:22 received the yes it may have received 61:24 yes from all of the participants and it 61:26 may have started sending out commit 61:28 messages to some of the participants and 61:30 that means that a may have actually seen 61:32 the commit message and committed and 61:34 made us changes permanent and unlocked 61:35 and showing the changes to other 61:37 transactions and since that could be the 61:39 case for all B knows in this region of 61:42 the protocol B cannot unilaterally 61:44 decide to abort at the times out it must 61:47 wait indefinitely to hear from the 61:49 transaction coordinator as long as it 61:51 takes some human may have to come and 61:54 repair the transaction coordinator and 61:56 finally get it started again and have it 61:57 read this log and see oh yes you 62:00 committed that transaction and finally 62:02 send long delayed commit messages so and 62:13 similarly if on a time I you can't you 62:23 can't unilaterally abort it turns out 62:25 you can't unilaterally commit either 62:27 because for all B knows a might have 62:29 voted no but he just hasn't got the 62:31 important message yet so you could in 62:33 this region you can either abort nor 62:35 commit 62:36 on a timeout and so this actually this 62:44 this blocking behavior is sort of 62:47 critical property of two-phase commit 62:51 and it's not a happy property 62:53 it means if things go wrong you can 62:56 easily be in the situation where you 62:58 have to wait for a long time with locks 62:59 held and holding up other transactions 63:01 and so among other things people try 63:05 really hard to make this part of 63:08 two-phase commit acts as fast as humanly 63:10 possible so that the window of time in 63:13 which a failure might cause you to block 63:17 with locks held for a long time is as 63:20 small as possible so they try to make 63:22 this part of the protocol very 63:23 lightweight or even have variants of the 63:26 protocols that for certain special cases 63:27 may not have to wait at all okay so 63:33 that's the basic protocol one thing to 63:37 notice about this that is a fundamental 63:41 part of why we're able to get to 63:44 actually build a protocol that allows a 63:46 and B to sort of both you know they both 63:49 commit or they both have or abort one 63:53 reason for that is that really the 63:54 decision is made by a single entity it's 63:56 made by the transaction coordinator 63:58 alone a and B are neither of them you 64:01 know except that they vote no neither a 64:05 nor B is deciding whether to commit or 64:09 not and they certainly are not engaged 64:11 in a conversation with each other to try 64:13 to reach agreement about what is the 64:15 other thinking or they thinking commit 64:17 may be all commit to instead we have 64:19 this much is quite sort of fundamentally 64:22 simple protocol in which only the 64:25 transaction coordinator makes the 64:27 decision a single entity and it just 64:29 tells the other party here's my decision 64:31 please go do it the penalty for that for 64:38 having the transaction coordinator 64:39 really the single entity make the final 64:42 decision again is the fact that you have 64:45 to block there's some points in which 64:46 you have to block waiting for the 64:47 transaction recording coordinator to 64:49 tell you what the decision 64:50 was one further question is that we know 64:58 the transaction coordinator must 64:59 remember information about transactions 65:02 and its log in case it crashes and so 65:05 one question is when the transaction 65:06 coordinator can forget about information 65:10 in its log about transactions and the 65:11 answer to that is that if it manages to 65:14 get a full set of acknowledgments from 65:16 the participants then it knows that all 65:18 the participants know that that 65:19 transaction committed or aborted that 65:22 all the transactions no participants 65:24 knew the fate of that transaction and 65:25 have done their part in it and will 65:27 never need to know that information 65:29 right as they both acknowledged it so 65:31 when the transaction coordinator gets 65:33 acknowledgements it can erase all 65:35 information all memory the transaction 65:39 similarly participants once they 65:42 received a commit or abort message and 65:44 done their part of the transaction and 65:46 made their updates permanent and 65:48 released their locks at that point the 65:50 participants also can completely forget 65:53 about that transaction after they send 65:57 their acknowledgment back to the 65:59 transaction coordinator now of course 66:01 the transaction coordinator may not get 66:03 their acknowledgement and may send and 66:05 may therefore decide to resend the 66:07 commit message on the theory that maybe 66:09 it was lost and in that case a 66:11 participant if it receives a commit 66:13 message for a transaction which it know 66:14 nothing about because it's forgotten 66:16 about it then the participant can just 66:21 send another acknowledgement back 66:22 because it knows that it gets a commit 66:25 message for an unknown transaction it 66:27 must be because it had forgotten about 66:28 it because it already knew whether it 66:30 committed or aborted okay so that's 66:37 two-phase commit for atomic commitment 66:41 for a little perspective two-phase 66:44 commit is used in a lot of sharded 66:47 databases that have split up their data 66:50 among multiple servers and it's used 66:54 specifically in databases or storage 66:58 systems that need to support 67:00 transactions in which records in which 67:03 multiple 67:03 records may be read or written there's a 67:06 lot of some more specialized storage 67:09 systems that don't allow you to have 67:12 transactions on multiple records and for 67:15 them you don't need it you no need this 67:17 kind of you don't need two-phase commit 67:18 if the storage system doesn't allow 67:22 multi record transactions but if you 67:24 have multi record transactions and you 67:26 shard the data across multiple servers 67:28 then you need to support either 67:30 toothpaste you need to support two in 67:31 pace commit if you want to get asset 67:34 transactions 67:36 however two-phase commit has an evil 67:39 reputation one reason is it's slow due 67:43 to multiple rounds of messages there's a 67:45 lot of chitchat here in order to get a 67:48 transaction that involves multiple 67:50 participants to finish theirs in 67:53 addition a lot of disk writes both a and 67:55 B have to not just write data to their 67:58 disk between the prepare and the sending 68:01 of the yes they have to wait for that 68:02 disk rate to finish so certainly if 68:04 you're using a mechanical Drive that 68:06 takes 10 milliseconds to append to the 68:09 log that puts a real serious limit on 68:11 how fast participants can process 68:14 transactions you know 10 milliseconds a 68:16 pop means no without some cleverness 68:19 you're limited to 100 transactions per 68:21 second which is pretty slow and in 68:23 addition the transaction coordinator 68:25 also has a point in which it must after 68:28 it receives the last yes they must first 68:30 write to its log make sure the data is 68:33 safe on disk and only then is that 68:35 allowed to send that commit messages and 68:38 that's another 10 milliseconds and both 68:41 of these are 10 millisecond periods in 68:43 which locks are held in the participants 68:45 and other transactions are slowed up and 68:47 I keep mentioning that but it's very 68:48 important because in a busy transaction 68:51 processing system there's lots and lots 68:53 of transactions and many of them may be 68:55 waiting for the same data and we'd 68:57 really prefer not to hold locks over 69:01 long periods of time in which there's 69:02 lots of messages going back and forth 69:04 then we have to wait for long disgrace 69:06 but two-phase commit forces us to do 69:09 those weights 69:13 and a further problem with it is that if 69:16 anything goes wrong messages are lost 69:18 something crashes then if you're not if 69:21 you're a little bit unlucky then the 69:23 participants have to wait for long times 69:25 with locks held 69:26 so therefore to face commit you really 69:30 only see it within relatively small 69:32 domains within a single machine room 69:34 within a single organization you don't 69:36 see it for example did you transfers 69:39 between banks between different banks 69:42 you might possibly see it within a bank 69:44 if it's charted its database but you 69:47 would never see two days can it run 69:48 between distinct organizations that were 69:52 maybe physically separate because of 69:53 this blocking business you don't want to 69:56 put the fate of you know your database 69:58 and whether it's operational in the 70:00 hands of some other organization where 70:02 they crash at the wrong time you're 70:04 forced your database was forced to hold 70:07 locks for a long time and because it's 70:12 so slow also there's a lot a lot of 70:15 research has gone into either making it 70:19 fast or relaxing the rules in various 70:21 ways to allow to be faster or 70:24 specializing two-phase commit for very 70:27 specific situations in which you know 70:31 you can shave a message or write to the 70:33 disk or something off it because you 70:34 know you're only supporting a certain 70:36 limited kind of transaction so well 70:39 we'll see fair amount of this and the 70:40 rest of the course one question that 70:45 comes up a lot this exchange here where 70:51 you have a leader essentially and it 70:53 sends these messages to the followers 70:56 and you know we can only go forward if 71:00 the leader can only proceed if it 71:02 receives you know acknowledgments 71:04 replies from enough of the followers 71:07 this looks a lot like raft this 71:11 construction looks a lot like raft 71:13 however the properties of the protocol 71:17 and what you get out of it turn out to 71:18 be quite different from what we get out 71:20 of raft they solve very different 71:24 problems 71:25 so the way to think about it is that you 71:28 use raft to get high-availability by 71:31 replicating data on multiple 71:34 participants on multiple peers that is 71:37 the point of raft is to be able to 71:39 operate even though some of the server's 71:42 involved have crashed or are not 71:44 reachable and you can do this in raft 71:47 raft can do this because all the service 71:49 are doing the same thing they're doing 71:51 the same thing so we don't need all of 71:53 them to participate we only need a 71:55 majority two-phase commit however the 72:00 participants are not at all doing the 72:02 same thing the participants are each 72:04 doing a different part of the 72:05 transaction you know a maybe 72:07 incrementing record X and B maybe 72:10 decrementing record Y so two-phase 72:13 commit all the train all the participant 72:17 they all have to do their part in order 72:20 for the transaction to finish you really 72:22 need to wait for every single one of the 72:24 participants to do their thing so okay 72:31 so we got you know raft is replicating 72:34 doesn't need everybody to do their thing 72:35 two-phase commit 72:37 everybody's doing something different 72:39 that has to get done two-phase commit 72:42 does not help at all with availability 72:44 you know raft is all about availability 72:46 you can go on even if some of the 72:48 participants are not responding 72:50 two-phase commit is actually not at all 72:54 available it's not highly available at 72:56 all if anything goes wrong we risk 72:58 having to wait until that's repaired if 73:00 the transaction coordinator crashes at 73:02 the wrong time we simply have to wait 73:03 for to come up and read its log and send 73:05 out the commit messages right if if one 73:08 of these participants you know crashes 73:11 at the wrong time you know if we're 73:12 lucky we simply have to abort then we're 73:15 not lucky we have to say did you finish 73:16 that did you finish that so two-phase 73:19 commit is not at all about high 73:21 availability in fact it's it's a it's 73:23 quite low availability as such things go 73:25 any crash can hold up the whole system 73:28 and of course raft doesn't ensure that 73:33 all the participants do whatever the 73:36 operation is it only requires a majority 73:38 there may be 73:39 minority that totally didn't do the 73:40 operation at all and that's how the fact 73:42 that raft all the participants do the 73:44 same thing we don't have to wait for all 73:46 of them is why raft gets high 73:47 availability so these are quite 73:51 different protocols um it is however 73:55 possible to to usefully combine them 73:58 like two-phase commit is you know really 74:01 vulnerable to failures it's correct with 74:04 failures but it's not available with the 74:06 others so the question is could you 74:08 build some sort of combined system that 74:12 has the high availability of RAF to 74:14 replication but has two phase commits 74:19 ability to call as various different 74:21 parties each to do their part of the 74:23 transaction and the construction you 74:25 want actually is to use raft or paxos or 74:27 some other protocol like that to rep 74:31 individually replicate each of the 74:33 different parties so then we would for 74:37 this set up we would have like three 74:39 different clusters the transaction 74:41 coordinator would actually be replicated 74:43 service with you know three servers and 74:50 you know we'd run raft on these three 74:53 servers one will be elected as leader 74:54 they'd have replicated state they'd have 74:56 a log that helped them replicate we 74:58 don't only have to wait for a majority 75:00 the leader we'd only have to have a 75:02 minority of these to be up in order for 75:04 the transaction coordinator to do its 75:06 work and of course they would all and 75:08 you know sort of execute through the 75:10 various stages of the transaction and 75:12 the two-phase commit protocol by 75:16 basically by appending relevant records 75:19 to their logs and then each of the 75:21 participants would also be a cluster of 75:25 a rep our raft replicated cluster 75:40 so we would end up and they would chain 75:43 exchange messages back and forth you 75:46 know we'd send a commit message from the 75:49 replicated transaction coordinator 75:51 service to the replicated a server and 75:53 the replicated B server and this is you 75:58 know this is admittedly somewhat 75:59 elaborate but it does show you that you 76:01 can combine these ideas to get the 76:03 combination of high availability because 76:05 any one of these servers can crash and 76:07 the remaining two you keep operating 76:09 plus we get on this atomic commitment of 76:12 a and B are doing complete different 76:14 parts of the same transaction and we can 76:17 use two-phase commit to have the 76:19 transaction coordinator ensure that you 76:21 know that either both commit the whole 76:22 thing or they both abort their parts of 76:25 the transaction you'll actually build 76:30 something very much like this as part of 76:33 lab form which you will indeed build a 76:35 shard a database where each shard is 76:37 replicated in this form and there's a 76:40 basically a configuration manager which 76:42 will allow essentially transactional 76:45 shifting of chunks of shards of data 76:48 from one raft cluster to another under 76:52 the control of something that looks a 76:55 lot like a transaction coordinator so 77:00 lab 4 is like this and in addition in a 77:05 little bit we'll be reading a paper 77:07 called spanner which describes a 77:08 real-life database used by Google that 77:11 users also uses this construction in 77:14 order to do transactional writes to a 77:16 database all right thank you