字幕記錄 00:00 um maybe maybe we should get started um 00:08 it's been a long time since we've all 00:10 been in the same place at night I hope 00:11 everybody's doing well today I'd like to 00:16 talk about spanner the reason to talk 00:19 about this paper is that it's a rare 00:21 example of a system provides distributed 00:25 transactions over data that's widely 00:27 separated that is data that might be 00:28 scattered all over the internet and 00:30 different data centers I'm Saul most 00:32 never done in production systems of 00:36 course it's extremely desirable to be 00:37 able to have transactions the 00:39 programmers really like it and also 00:42 extremely desirable to have data spread 00:44 all over the network for both for fault 00:48 tolerance and to ensure that data is 00:50 near that there's a copy of the data 00:53 near everybody who wants to use it and 01:01 on the way to achieving this spanner 01:04 used at least two neat ideas one is that 01:07 they run two-phase commit but they 01:09 actually run it over paxos replicated 01:12 participants I'm in order to avoid the 01:15 problem the two-phase commit that a 01:17 crashed coordinator can block everyone 01:20 and the other interesting idea is that 01:22 they use synchronize time in order to 01:25 have very efficient read-only 01:26 transactions and the system is is that 01:30 actually been very successful it's used 01:32 a lot by many many different services 01:34 inside of Google it's been turned by 01:38 Google into a product to service for 01:40 their cloud-based customers and it's 01:43 inspired a bunch of other research and 01:46 other systems both sort of by the 01:48 example that it's kind of wide area 01:51 transactions are possible and also 01:53 specifically there's at least one opens 01:55 her system cockroach DB that uses a lot 01:58 of explicitly uses a lot of the design 02:01 the motivating use case the reason that 02:05 the paper says they first kind of 02:06 started the design spanner was that they 02:09 were already had a actually they had 02:11 many big database systems and 02:13 Google but their advertising system in 02:15 particular the data was shorted over 02:19 many many distinct my sequel and 02:22 BigTable databases and maintaining that 02:25 sharding was a just an awkward and 02:28 manual and time-consuming process in 02:30 addition their previous advertising 02:33 database system didn't allow 02:36 transactions that spanned more than a 02:38 single basically more than a single 02:40 server but they really wanted to be able 02:42 to have to spread their data out more 02:44 widely for better performance and to 02:47 have transactions over the multiple 02:51 shards of the data for their advertising 02:55 database apparently the workload was 02:56 dominated by read-only transactions I 02:59 mean you can see this in table 6 where 03:00 the there's billions of read-only 03:03 transactions and only millions of 03:06 readwrite transactions so they're very 03:09 interested in the performance of 03:11 read-only of transactions that only do 03:13 weeds and apparently they also required 03:16 strong consistency and that you know 03:18 what transactions in particular so they 03:21 wanted serializable transactions and 03:22 they also wanted external consistency 03:27 which means that if one transaction 03:28 commits and then after it finishes 03:33 committing another transaction starts 03:34 the second transaction needs to see any 03:37 modification is done by the first and 03:41 this external consistency turns out to 03:43 be interesting with replicated data all 03:50 right so 03:52 ownage are just a basic arrangement sort 03:56 of physical arrangement of their servers 03:57 that that spanner uses it has the its 04:01 servers are spread over data centers 04:03 presumably all over the world certainly 04:06 all over the United States and each 04:08 piece of data is replicated at multiple 04:10 data centers so the diagrams got to have 04:14 multiple data centers let's say there's 04:17 there's three data centers really 04:19 there'd be many more oops 04:26 so we have all day 04:27 dissenters then the data shard it that 04:29 it's broken up you can think of it has 04:31 been being broken up by key into and 04:35 split over many servers so maybe there's 04:37 one server that serves keys starting 04:39 with a in this data center or another 04:42 starting with B and so forth lots of 04:46 lots of charting with lots of servers in 04:48 fact every data center has on any piece 04:52 of data is any shard is replicated at 04:55 more than one data center so there's 04:57 going to be another copy another replica 04:58 of the a keys and the B keys and so on 05:01 the second day in the center and yet 05:04 another hopefully identical copy of all 05:08 this data at the third data center in 05:10 addition each data center has multiple 05:14 clients or their clients of spanner and 05:17 what these clients really are as web 05:19 servers so if our ordinary human beings 05:22 sitting in front of a web browser 05:24 connects to some Google service that 05:27 uses spanner 05:28 they'll connect to some web server in 05:30 one of the data centers and that's going 05:31 to be one of these one of these spanner 05:35 clients all right so that is replicated 05:40 the replication is managed by Paxos in 05:43 fact that really a variant of Paxos that 05:45 has leaders and is really very much like 05:48 the raft that we're all familiar with 05:50 and each Paxos instance manages all the 05:53 replicas of a given shard of the data so 05:56 this shard all the copies of this shard 06:00 form one Paxos group and all the 06:06 replicas are this shard form other packs 06:08 was group and within each these are 06:09 these patches instances independent as 06:13 its own leader runs its own version of 06:14 the of the poem instance of the packs 06:18 was protocol numb and the reason for the 06:21 sharding and for the independent paxos 06:25 instances per shard is to allow parallel 06:29 speed-up and a lot of parallel 06:31 throughput because there's a vast number 06:34 of clients you know which are 06:35 representing working on behalf of web 06:37 browsers so this huge number typically 06:39 of concurrent 06:41 requests and so it pays and more 06:43 immensely to split them up over multiple 06:46 shards and multiple sort of Paxos groups 06:52 that are running in parallel okay and 06:56 you can think of or each of these paxos 06:59 groups has a leader a lot like wrath so 07:02 maybe the leader for this shard isn't 07:04 data is a replica in datacenter one and 07:06 the leader for this shard might be the 07:10 replica and datacenter two and and so 07:13 forth and you know so that means that if 07:18 you need to if a client needs to do a 07:21 right it has to send that right to the 07:23 leader of the of the shard whose data it 07:28 needs to write just with Raph these 07:32 Paxos instances are what they're really 07:34 doing is sending out a log the leader is 07:36 sort of replicating a log of operations 07:38 to all the followers and the followers 07:40 execute that log which is for data is 07:42 gonna be reads and writes so it executes 07:45 those logs all in the same order all 07:53 right so the reason for these for this 07:55 setup the sharding as I mentioned for 07:58 throughput the multiple copies in 08:00 different data centers is for two 08:03 reasons one is you want copies and 08:06 different data centers in case one data 08:07 center fails if you know maybe you power 08:10 fails to the entire city the data 08:12 centers in or there's an earthquake or a 08:14 fire or something you'd like other 08:16 copies that other data centers that are 08:19 maybe not going to fail at the same time 08:20 and then you know there's a price to pay 08:22 for that because now the paxos protocol 08:24 now has to talk maybe over long 08:27 distances to talk to followers and 08:29 different data centers the other reason 08:31 to have data in multiple data centers is 08:33 that it may allow you to have copies of 08:35 the data near all the different clients 08:39 that use it so if you have a piece of 08:40 data that may be read in both California 08:42 and New York maybe it's nice to have a 08:45 copy of that data one copy in California 08:48 one copy in New York so that reads can 08:50 be very fast and indeed a lot of the 08:53 focus that 08:53 time is to make reads from the local the 08:57 nearest replica both fast and correct 09:02 finally another interesting interaction 09:04 between Paxos and multiple data centers 09:06 is that 09:07 paxos lie craft only requires a majority 09:10 in order to replicate a log entry and 09:13 proceed and that means if there's one 09:14 slow or distant or flaky data center the 09:18 Paxil system can keep chugging along and 09:20 accepting new requests even if one data 09:22 center is is being slow all right so 09:28 with this arrangement there's a couple 09:31 of big challenges that paper has to bite 09:33 off one is they really want to do reads 09:35 from local data centers but because 09:39 they're using Paxos and because Paxos 09:41 only requires each log entry to be 09:45 replicated on a majority that means a 09:47 minority of the replicas may be lagging 09:49 and may not have seen the latest data 09:52 that's been committed by paxos and that 09:56 means that if we allow clients to read 09:58 from the local replicas for speed they 10:02 may be reading out-of-date data if their 10:04 replica happens to be in the minority 10:06 that didn't see the latest updates so 10:08 they have to since they're requiring 10:09 correctness they're requiring this 10:11 external consistency idea that every 10:16 read see the most up-to-date data they 10:18 have to have some way of dealing with 10:20 the possibility that the local replicas 10:22 may be lagging another issue they have 10:26 to deal with is that a transaction may 10:28 involve multiple shards and therefore 10:30 multiple paxos groups so you may be 10:32 reading or writing a single transaction 10:34 may be reading or writing multiple 10:35 records in the database that are stored 10:37 in multiple shards and multiple Paxil 10:40 scripts so those have to be we need 10:42 distributed transactions okay so I'm 10:49 going to explain how the transactions 10:51 work that's going to be the kind of 10:52 focus of the lecture spanner actually 10:56 beats implements readwrite transactions 10:58 quite differently from read-only 11:00 transactions so let me start with your 11:02 beauty of readwrite transactions which 11:03 are so have a lot more conventional in 11:07 their design alright so first readwrite 11:20 transactions let me just remind you at a 11:27 transaction looks like so let's just 11:30 choose a simple one that's like 11:32 mimicking bank transfer so I'm one of 11:37 those client machines a client of 11:39 spanner you'd run some code you run this 11:41 transaction code the code would say oh 11:42 I'm beginning a transaction and then I 11:45 would say oh I want to read and write 11:46 these records so maybe you have a bank 11:48 balance in database record X and we want 11:50 to you know increment and increase this 11:53 bank balance and decrease y's bank 11:56 balance and oh that's the end of the 11:58 transaction and now the client hopes the 12:01 database will go off and commit that 12:05 alright so I want to trace through all 12:08 the steps that that have to happen in 12:11 order to in order for spanner to execute 12:15 this read write transaction so first of 12:17 all there's a client in one of the data 12:18 centers that's driving this transaction 12:21 so I'll draw this client here let's 12:24 imagine that x and y are on different 12:25 shards since that's the that's the 12:28 interesting case and that those shards 12:31 each of the two shards is replicated in 12:35 three different data centers so know we 12:38 got our three data centers here and at 12:44 each data center there's a server that 12:47 I'm just going to write x for the 12:51 replicas of the shard that's holding act 12:55 with the bank balance for x and y for 12:58 the these three servers spinner once 13:03 two-phase commit just to totally stand 13:06 our two-phase commit and two phase 13:08 locking almost exactly as described in 13:13 the reading from last week from the 603 13:16 three textbook and the huge difference 13:18 is that instead of the participants and 13:22 the transaction manager being individual 13:24 computers the participants in the 13:26 transaction manner manager are Paxos 13:30 replicated 13:33 groups of servers for increased fault 13:35 tolerance so that means just to remind 13:37 you that the shard the three replicas of 13:42 the shard that stores X it's a really 13:44 app access group same with these three 13:46 replicas strong Y and we'll just imagine 13:49 that for each of these one of the three 13:51 servers is the leader so let's say the 13:53 server and data center 2 is the Paxos 13:56 leader for the X is shard and the 14:00 servant is saying one is the Paxos 14:02 leader for y sharp okay so the first 14:08 thing that happens is that the coin 14:09 picks a unique transaction ID which is 14:11 going to be carried on all these 14:13 messages so that the system knows that 14:16 all the different operations are 14:18 associated with a single transaction the 14:21 first thing that does the client has to 14:22 be so despite the way the code looks 14:25 where it reads and writes X then read 14:27 some write Y in fact the way the code 14:30 has transaction code has to be organized 14:32 it has to do all its reads first and 14:34 then at the very end do all the writes 14:36 at the same time essentially as part of 14:39 the commit so the clients to do good 14:44 reads it turns out that it in order to 14:49 maintain locks since just as as in last 14:53 week's 6:53 reading every time you read 14:57 or write a data item the server 15:00 responsible for it has to associate a 15:03 lock with that data item the locks are 15:06 maintained the read locks and spanner 15:09 maintain only in the Paxos leader so 15:12 when the client transaction wants to 15:14 read access sends a read X request to 15:18 the leader of X is shard and that leader 15:23 of the shard returns the current value 15:25 of x plus sets a lock on X of course if 15:28 the locks already set then you won't 15:30 respond to the client until whatever 15:32 transaction currently has the data 15:34 locked releases the lock by committing 15:36 and then the leader for that shard since 15:40 back the value of x to client the client 15:42 needs to read Y got lucky this time 15:44 because the 15:46 assuming like clients in data center one 15:48 the leaders in the local data center so 15:52 this reads gonna be a lot faster that 15:54 reads sets the lock on Y in the taxes 15:58 leader and then returns okay's now the 16:00 clients on all the reads it does 16:02 internal computations and figures out 16:03 the writes that wants to do what values 16:05 wants to write to x and y and so now the 16:09 clients going to send out the updated 16:12 values for the records that it wants to 16:16 write and it does this all at once at 16:18 the end towards the end of the 16:20 transaction so the first thing it does 16:23 is it chooses one of the packs those 16:25 groups to act as the transaction 16:28 coordinator and as it chooses us in 16:31 advance and it's gonna send out the 16:33 identity of the which Paxos group is 16:35 going to act as the transaction 16:37 coordinator so let's assume it chooses 16:40 this Paxos group i've split a double box 16:42 here to say that not only is this server 16:45 the leader of its Paxos group it's also 16:47 acting as transaction coordinator for 16:49 this transaction then the client sends 16:52 out the updated values that it wants to 16:57 write so it's going to send a write 16:58 extra write X request here with a new 17:02 value and the identity of the 17:03 transaction coordinator when each the 17:09 Paxos leader for each written value 17:11 receives the write request it sends out 17:18 a prepare message to its followers and 17:22 gets that into the Paxos log so that 17:24 I'll represent that by P into the paxos 17:27 log because it's gonna commit to being 17:30 able to come it's the wrong word it's 17:33 promising to be able to carry out this 17:35 transaction that it hasn't crashed for 17:36 example and lost its locks so it sends 17:40 out this prepare message logs the 17:43 prepare message through paxos when it 17:45 gets a majority of responses from the 17:47 followers then this pacts was leader 17:51 sends a yes to the transaction 17:54 coordinator saying yes I am a promising 17:58 to be able to carry out my part of the 17:59 trend 18:00 right to Y V and notionally the 18:07 transaction see the client also 18:09 sentenced the value to be bitten - why - 18:15 why's paxos leader and this server 18:21 acting as paxos leader sends out prepare 18:24 messages to his followers and logs it 18:28 impacts those weights for the 18:29 acknowledgments from a majority and then 18:33 you can think of it as as the Paxos 18:36 leaders sending the transaction 18:40 coordinator which is on the same machine 18:43 maybe the same program a yes vote saying 18:45 yes I can I can commit okay so when the 18:48 transaction coordinator gets responses 18:52 from all the different from the leaders 18:54 of all the different shards whose data 18:56 is involved in this transaction if they 18:59 all said yes then the transaction 19:01 coordinator can commit otherwise it 19:03 can't let's assume it decides to commit 19:09 at that point the transaction 19:11 coordinator sends out to the paxos 19:15 followers a commit message saying look 19:19 please remember that permanently in the 19:22 transaction log that we're committing 19:26 this transaction and it also tells the 19:34 leaders of the other PACs those groups 19:37 involved in the transaction then they 19:39 can commit as well and so now this 19:42 leader sends out commit messages to his 19:45 followers as well as soon as the commits 19:49 are 19:54 the transaction coordinator probably 19:56 doesn't send out the commit message to 19:58 the other shards until it's committed as 20:00 safe in the log so that the transaction 20:02 coordinator is not guaranteed not to 20:03 forget its decision once commits these 20:07 commit messages are committed into the 20:09 paxos logs of the different shards each 20:12 of those shards can actually execute the 20:14 rights that is place the written data 20:16 and release the locks on the data items 20:21 so that other transactions can use them 20:26 and then the transactions over so first 20:35 of all please feel free to ask questions 20:36 if by raising your hand a few if you 20:40 have questions ok so there's some points 20:45 to observe about the design so far which 20:48 is only covered the readwrite aspect of 20:50 transactions one is that it's that the 20:53 locking that is insuring serializability 20:56 that is of two transactions conflict 20:59 because they use the same data one has 21:01 to completely wait for the other 21:02 releases locks before it can proceed so 21:05 it's using so spanners using completely 21:07 standard two-phase locking in order to 21:11 get serializability and completely 21:13 standard two-phase commit to get 21:15 distributed transactions the two-phase 21:19 commits widely hated because if the 21:22 transaction coordinator should fail or 21:25 become unreachable than any transactions 21:27 it was managing block indefinitely until 21:31 the transaction coordinator comes back 21:32 up and they block with locks help so 21:35 people have been in general very 21:37 reluctant to use two-phase commit in the 21:40 real world because it's blocking spanner 21:43 solves this by replicating the 21:46 transaction manager the transaction 21:48 manager itself is a Paxos replicated 21:50 state machine so everything it does like 21:53 for example remember whether it's 21:55 committed or not is replicated into the 21:57 paxos log so if the leader here fails 22:02 even though it was managing the 22:04 transaction because it's raft replicated 22:07 either of these two 22:08 replicas can spring to life take over 22:11 leadership and also take over being the 22:14 transaction manager and they'll have in 22:16 their law it's the transaction manager 22:17 decided to commit any leader that takes 22:20 over will see a commitment it's log and 22:22 be able to then tell the other right 22:25 away tell the other participants and 22:27 two-phase commit that look oh this 22:28 transaction was committed so this 22:30 effectively eliminates the problem a 22:34 two-phase commit that it can block with 22:36 locks held if there's a failure this is 22:38 a really big deal because this problem 22:41 basically makes two-phase commit 22:43 otherwise completely unacceptable for 22:45 any sort of large-scale system that has 22:47 a lot of parts that might fail the other 22:49 another thing to note is that there's a 22:51 huge amount of messages on in this 22:55 diagram here and that means that many of 22:59 them are across data centers and said 23:02 the some of these messages that go 23:03 between the shards or between a client 23:05 and a shard whose leaders in another 23:07 data center may take many milliseconds 23:09 and in a world in which you know 23:12 computations take nanoseconds this is 23:16 essentially pretty grim expense and 23:22 indeed you can see that from in table 23:25 six and table six if you look at it it's 23:28 describing the performance of a spanner 23:32 deployment where the different replicas 23:33 are on different sides of the United 23:35 States I east and west coast and it 23:38 takes about a hundred milliseconds to do 23:41 complete a transaction where the 23:43 different replicas involved are on 23:45 different coats that's a huge amount of 23:47 time it's a tenth of a second there's 23:50 maybe not quite as bad as it may seem 23:51 because the throughput of the system 23:53 since it's sharded and it can run a lot 23:55 of non conflicting transactions in 23:57 parallel the throughput may be very hard 23:58 high but their delay for individual 24:02 transactions very significant I mean a 24:03 hundred milliseconds is maybe somewhat 24:05 less than a human is going to notice but 24:07 if you have to do a couple of them to 24:09 just say generate a webpage or carry out 24:11 a human instruction it's starting to be 24:13 amount of time whoops 24:14 you noticeable start to be bothering 24:16 bothersome 24:19 on the other hand for I think I suspect 24:21 from many uses of spanner all the 24:23 replicas might be in in the same city or 24:26 sort of across town and they're the much 24:29 faster times that you can see in Table 24:31 three are relevant in the Earth's Table 24:34 three shows that it can complete 24:36 transactions where the data centers are 24:38 nearby in all right you know I think 24:40 it's 14 milliseconds instead of 100 24:42 milliseconds so that's not quite so that 24:44 nevertheless these read/write 24:47 transactions are slow enough that we'd 24:50 like to avoid the expense if we possibly 24:54 can so that's going to take us to 24:58 read-only transactions it turns out that 24:59 if you're not writing that is if you 25:01 know in advance that all of the 25:03 operations in a transaction are 25:05 guaranteed to be reads then spanner has 25:08 a much faster much more streamlined much 25:10 less massive message intensive scheme 25:13 for executing read-only transactions 25:19 okay so so read-only transactions start 25:28 a new topic the reader only transactions 25:30 work although they rely on some 25:32 information from readwrite transactions 25:33 to designs quite different from the read 25:38 of the readwrite transactions in spanner 25:45 eliminates two big costs and it's 25:49 read-only transaction design eliminates 25:50 two of the costs that were present and 25:53 readwrite transactions first of all as I 25:54 mentioned it reads from local replicas 25:57 and so if you have a replica as long as 26:00 there's a replica the DVD the client 26:02 needs the transaction needs in the local 26:04 data center you can do the read and from 26:06 that local replica which may take a 26:08 small fraction of a millisecond to talk 26:10 to instead of maybe dozens of 26:12 milliseconds if you have to go cross 26:14 country so it can read from local 26:15 replicas but node you know again a 26:18 danger here is that any given replicas 26:20 may not be up-to-date so there has to be 26:23 a story for that 26:24 and the other big savings and the 26:27 read-only design is that it doesn't use 26:29 locks it doesn't use two-phase commit I 26:32 mean that doesn't need a transaction 26:33 manager and this the voids things like 26:37 cross data center or inter data center 26:39 messages to PACs those leaders and 26:42 because no locks are taken out not only 26:44 does that make the read-only 26:45 transactions faster but it avoids 26:47 slowing down read only read write 26:49 transactions because they don't have to 26:50 rate for locks held by read-only 26:52 transactions now I mean just to kind of 26:54 preview why this is important to them 26:57 tables 3 & 6 show a ten times latency 27:01 improvement for read-only transactions 27:03 compared to readwrite transactions so 27:07 the main only design is submit factor 27:10 ten boost in latency and much less 27:13 complexity is almost certainly far more 27:15 throughput as well and the big challenge 27:17 is going to be how to square the you 27:19 know really transactions don't do a lot 27:21 of things that were quiet required and 27:23 we don't rewrite transactions to get 27:25 serialize ability so we need to find a 27:28 need to find a way to kind of square 27:30 this increased efficiency with 27:32 correctness and so there's really two 27:35 main correctness constraints that they 27:39 wanted to have read-only transactions 27:42 imposed the first is that they like all 27:44 transactions they still need to be 27:46 serializable and what that means is that 27:52 even though just a review even though 27:55 the system may execute transactions 27:58 concurrently in parallel the results 28:01 that a bunch of concurrent transactions 28:04 must yield both in terms of sort of 28:06 values that they return to the client 28:08 and modifications to the database the 28:10 results of a bunch of concurrent 28:12 transactions must be the same as some 28:14 one at a time or serial execution of 28:19 those transactions and for read-only 28:23 transactions what that essentially means 28:25 is that the an entire all the reads of a 28:28 read-only transaction must effectively 28:30 fit neatly between all the rights of a 28:34 bunch of transactions that can be viewed 28:37 as going before it 28:38 and and it must not see any of the 28:41 rights of the transactions that we're 28:42 going to view as it's going after it so 28:45 we need a way to sort of fit to read all 28:47 the reads of a transaction read-only 28:48 transaction kind of neatly between 28:50 readwrite transactions well the other 28:56 big constraint that the paper talks 29:00 about is that they want external 29:01 consistency and what this means it's 29:08 actually equivalent to linearise ability 29:15 that we've seen before what this really 29:16 means is that if one transaction commits 29:19 finishes committing and another 29:21 transaction starts after the first 29:24 transaction completed in real time then 29:27 the second transaction is required to 29:30 see the rights done by the first 29:32 transaction another way of putting that 29:34 is that transactions even read-only 29:36 transactions should not see stale data 29:39 and if there's a committed write from a 29:42 completed transaction that's prior to 29:45 the readonly transaction prior to the 29:47 start of the read-only transaction the 29:49 read-only transaction is required to see 29:51 that right ok so this is actually none 29:57 of neither of these is particularly 29:58 surprising but standard databases like 30:02 my sequel or something for example can 30:07 be configured to provide this kind of 30:09 consistency so in a way it's sort of the 30:11 consistency that if you didn't know 30:14 better this is exactly the consistency 30:16 that you would expect of a 30:18 straightforward system and in the you 30:21 know 30:23 have it but it makes programmers lives 30:24 it makes it much easier to produce 30:27 correct answers in otherwise you don't 30:30 have this kind of consistency then the 30:32 programmers are responsible for kind of 30:34 programming around whatever anomalies 30:36 the database may provide so this is like 30:38 a night this is sort of the gold 30:39 standard of correctness 30:42 okay so let's I want to gonna talk about 30:48 how we'd only transactions work it's a 30:50 bit of a complex story so I think what 30:52 I'd like to talk about first is to just 30:54 consider what would happen if we did 30:57 just absolutely the stupidest thing and 30:59 had the read-only transactions not do 31:02 anything special to achieve consistency 31:05 but just read the very latest copy of 31:07 the data so every time I read only 31:09 transaction does a read we could just 31:12 have it look at the local replicas and 31:15 find the current most up-to-date copy of 31:20 the data and that would be very 31:21 straightforward very low overhead so we 31:24 need to understand why that doesn't work 31:27 in order so this is a so why not read 31:34 the just a the latest value and so maybe 31:43 we'll imagine that the transaction is a 31:45 transaction that simply reads x and y 31:51 and prints them finance read-only I'm 31:54 going to print Y I'll just print X comma 31:57 Y 32:01 okay so all I want to show you an 32:03 example of a situation in which read 32:07 having this transaction is just simply 32:08 be the latest value yields incorrect not 32:12 not serializable results so suppose we 32:14 have three transactions running t1 t2 t3 32:21 t3 is going to be our transaction t1 and 32:24 t2 or transactions that are our rewrite 32:27 transactions so let's say that t1 right 32:31 sex and rights why and then commits and 32:36 you know maybe it's a bank transfer 32:37 operation so it's transferring money 32:39 from X to Y and we're printing x and y 32:41 because we're doing an audit of the bank 32:42 try to make sure it hasn't lost money 32:44 let's imagine that transaction 2 also 32:48 does another transfer between balances x 32:53 and y and then commits and now we have 32:54 our transaction transaction t3 it needs 32:57 to read x and y so it's gonna have a 32:59 read of X let's say the read of X 33:01 happens at this point in time and so I'm 33:04 the way I'm drawing these diagrams is 33:07 that real time moves to the right wall 33:10 clock time time you'd see on your watch 33:12 moves to the right so the read of X 33:14 happens here after transaction 1 33:17 completes before transaction 2 starts 33:20 and let's say T 3 is running on a slow 33:22 computer so it only manages to issue the 33:24 read of Y much later so the way this is 33:29 gonna play out is that transaction 3 33:31 will see the Y value that t1 wrote but 33:35 the x value that t2 wrote 33:41 assuming it uses this dubious procedure 33:45 of simply reading the latest value 33:47 that's in the database and so this is 33:51 not serializable because well we know 33:56 that any serial order that could exist 33:59 must have t1 followed by t2 there's only 34:06 2 places teeth rica go so t3 could go 34:09 here 34:13 can't fit here because if t3 was second 34:15 in the equivalent serial order then it 34:18 shouldn't see rights by t2 which comes 34:20 after it should see the value of Y 34:22 produced by t1 but it doesn't write it 34:25 see the value produced by t3 by t2 so 34:28 this is not an equivalent this serial 34:31 order wouldn't produce the same results 34:33 the only other one available to us is 34:35 this one this serial order would get the 34:39 same value for y that t3 actually 34:41 produced but if this was the serial 34:45 order then t3 should have seen the value 34:47 written by t2 but it actually saw the 34:49 valuable written by t1 so this execution 34:52 is not equivalent to any one at a time 34:55 searing the order so this is like 34:58 there's something broken about reads 35:03 simply reading the latest value so we 35:06 know that doesn't work you know what 35:08 we're really looking for of course is 35:09 that either the our our transaction 35:13 either reads the both values at this 35:16 point in time or it reads both values at 35:20 this point in time okay so the approach 35:31 that span our taste to this it's a 35:36 somewhat complex the first big idea is 35:40 an existing idea 35:42 it's called snapshot isolation 35:52 and the way I'm gonna describe this is 35:59 that let's imagine that all the 36:01 computers involved had synchronized 36:04 clocks that is you know they all have a 36:06 clock the clock wields yields us or wall 36:09 clock time like oh it's 143 in the 36:13 afternoon on April 7th 2020 so that's 36:17 what we mean by a wall clock time a time 36:20 so it's assumed that all the computers 36:21 assume even though this isn't true that 36:25 all the computers involved have 36:26 synchronized times furthermore let's 36:29 imagine that every transaction is 36:32 assigned a particular time a time stamp 36:37 and 36:42 time stamps their wall clocks times 36:46 taken from these synchronized clocks for 36:48 readwrite transaction its timestamp is 36:52 I'm going to say just for this for this 36:55 simplified design is the real time at at 36:58 the commit 37:01 and for read for a or at the time at 37:06 which the transaction manager starts the 37:08 commit and for read-only transaction the 37:11 timestamp is equal to the start time all 37:17 right so every turns out 37:18 time and we're gonna design our system 37:22 or a snapshot isolation system gets is 37:25 designed to execute as if to get the 37:28 same results as if all the transactions 37:31 had executed in timestamp order so we're 37:34 going to assign the transactions each 37:35 transaction a timestamp and then we're 37:37 going to arrange the executions so that 37:40 the transactions gets the results as if 37:42 they had executed in that order so given 37:45 the timestamps we sort of need to have 37:46 an implementation that will kind of 37:49 easily honor the timestamps and 37:51 basically you know show each transaction 37:53 the data sort of as it existed at its 37:57 timestamp okay so the way that this 38:01 works for read-only transactions is that 38:09 each replica when it stores data it 38:12 actually has multiple versions of the 38:13 data so we have a multiple version 38:19 database every database record has you 38:24 know maybe if it's been written a couple 38:26 times it has a separate copy of that 38:27 record for each of the times it's been 38:29 written 38:30 each one of them associated with the 38:33 timestamp of the transaction that wrote 38:35 it and then the basic strategies that 38:42 read only transactions when they when a 38:45 read-only transaction does a read it's 38:47 already allocated itself a timestamp 38:49 when it started and so it accompanies 38:52 its read request with its timestamp and 38:55 the whatever server that stores the 39:00 replicas of the data that the 39:02 transaction needs it's going to look 39:04 into its multi version database and find 39:06 the record that's being asked for that 39:10 as the highest time that's still less 39:12 than the timestamp specified by the 39:16 read-only transaction so that means to 39:17 be the only transaction sort of sees 39:19 data that is data as of the time as up 39:23 it's time jozin timestamp okay so this 39:28 is for this snapshot isolation idea 39:32 works for read-only transactions or 39:34 spanner uses it for read-only 39:36 transactions spinner users still uses 39:40 two-phase locking and two-phase commit 39:42 for readwrite transactions and so the 39:45 readwrite transactions allocate 39:47 timestamps for themselves a commit time 39:48 but other than that they work in the 39:50 usual way with locks and two-phase 39:52 commit where's the read-only 39:54 transactions access multiple versions in 39:58 the database and get the version that's 39:59 you know written by the has the 40:03 timestamp 40:04 that's highest that's still less than 40:05 the read-only transactions times date 40:07 and where this is going to get us is 40:09 that you know read-only transactions 40:11 will see all the rights of readwrite 40:14 transactions with lower timestamps and 40:16 none of the rights of read/write 40:18 transactions with higher tyst timestamps 40:21 okay so how would 40:24 isolation work out for our example 40:33 example that I had here before in which 40:37 we had a failure of serial 40:39 serializability because reading 40:44 transaction read before I read values 40:51 that were not between any two other be 40:53 bright transactions okay so this is an 40:56 our example but with snapshot isolation 41:01 I'm showing you this to show that the 41:04 snapshot isolation technique solves our 41:08 problem causes the read-only transaction 41:11 to be serializable so again we have 41:15 these two readwrite transactions t1 and 41:18 t2 and we have our transaction that's a 41:20 read-only transaction t1 and t2 right as 41:29 before they write and they commit but 41:36 now 41:36 they're allocating themselves timestamps 41:39 as of the commit time so in addition to 41:41 using two-faced command and two-phase 41:43 locking these read/write transactions 41:44 allocate a timestamp so let's imagine 41:46 that at the time of the commit T one 41:49 looked at the clock and saw that it the 41:52 time was ten I'm gonna use times of ten 41:54 and twenty and whatnot but you know you 41:57 should imagine times as being real times 41:59 like four o'clock in the morning on a 42:01 given day so let's say that T one sees 42:05 the time as 10 when it committed and T 2 42:09 sees that the commit time the time was 42:11 20 so I'm gonna write these transactions 42:14 chosen timestamp after the @ sign then 42:18 the database storage systems the span 42:22 our storage systems are going to store 42:25 when transaction 1 does its writes 42:27 they're gonna store a new sort of not 42:29 instead of overwriting in the current 42:31 value they're just gonna add a new copy 42:33 of this record with the timestamp so 42:35 it's gonna the database is going to 42:37 store away a new record this says the 42:39 value of x at time 10 is whatever it 42:43 happens to be let's say 9 the value of 42:46 record Y at time 10 is C 11 maybe we're 42:51 doing a transfer from X to Y similarly C 42:56 2 chose timestamp of 20 because that was 42:58 the real time at commit time and the 43:00 database is gonna remember a new set of 43:02 Records in addition these old ones it's 43:04 gonna say X at time 20 maybe we did a 43:10 another transfer from X to Y and Y at 43:14 time 20 equals 12 oh so now we have two 43:18 copies of each record at different times 43:19 now transaction 3 is gonna come along 43:21 and again it starts at about this time 43:25 and does a read of X and again it's 43:27 gonna be slow so you know it's not gonna 43:30 get around to reading wine till much 43:31 later much later in real time 43:35 however when transaction 3 started it 43:38 chose a timestamp by looking at the 43:40 looking at the current time and so let's 43:43 say since we know in real time that 43:45 transaction 3 started after transaction 43:48 one on before transaction 2 43:50 no it's got to have chosen a transaction 43:52 time somewhere between 10 and 20 and 43:55 let's suppose it started it time 15 and 43:59 chose timestamp 15 for itself so that 44:02 mean when it does the read of X it's 44:05 gonna send a request the local replica 44:09 that holds X and it's gonna accompany it 44:11 with it it's time stamp of 15 it's gonna 44:13 say please give me the latest data as of 44:15 time 15 of course transaction 2000 44:19 executed yet and but nevertheless the 44:21 highest time stamp copy of X is the one 44:26 from time 10 written by transaction 1 so 44:29 we're gonna get 9 for this one time 44:33 passes transaction 2 commits now 44:35 transaction 3 does the second read again 44:37 at a company suit the read requests with 44:39 its own time stamp of 15 Center the 44:42 server's now the server's have to 44:43 records but again because the server 44:46 gets transaction threes time stamp of 15 44:48 it looks at its records and say ha 15 44:51 sits between these two I'm gonna return 44:53 the highest time stamp record for X for 44:56 y it's less than the requested time 44:59 stamp and that's still the version of Y 45:02 from time 10 so the read of Y will 45:04 return at 11 45:06 that is the read of X essentially 45:09 happens at this time but because we 45:11 remembered a time stamp and we have the 45:13 database keep data as of different times 45:17 it was written 45:17 it's as if both reads happened the time 45:21 15 instead of one at time 15 and one 45:25 later and now you'll see that in fact 45:29 this just essentially emulates a serial 45:33 one at a time execution in which the 45:35 order is timestamp order transaction 1 45:38 and transaction - sorry then transaction 45:41 3 then transaction 2 that is a serial 45:46 order that is equivalent to that was 45:48 also actually produced is the time stamp 45:50 order of 10 15 20 45:55 alright okay so that's a simplified 46:01 version of what spanner does for really 46:05 transactions there's more complexity 46:09 which I'll get to in a minute 46:11 one question you might have is why it 46:15 was okay for transaction 3 to read an 46:17 old value of y that is it issued this 46:20 read of Y at this point in time the 46:23 freshest data for why was this value 12 46:27 but the value would actually got was 46:29 intentionally a stale value not the 46:32 freshest value but the value from a 46:34 while ago this value 11 so why is that 46:37 okay why is it okay not to be using the 46:39 freshest version of the data and the 46:45 kind of technical justification for that 46:47 is that transaction 2 and transaction 3 46:50 are concurrent that is the overlap in 46:53 time so those sort of time extent of 46:55 transaction 2 is here and the time 46:58 extent of transaction 3 is here they're 47:00 concurrent and the rules for linearise 47:03 ability and external consistency or that 47:05 if two transactions are concurrent then 47:09 the serial order that the database is 47:12 allowed to use can be can put the two 47:15 transactions in either order and here 47:17 the database spanner has chosen to put 47:20 transaction 3 before transaction 2 in 47:22 the serial order okay Robert we we have 47:28 a student question does external 47:30 consistency like with timestamps always 47:32 imply a strong consistency I'm 47:39 yes yes I think so 47:43 if strong consistency strong consistency 47:46 usually what people mean by that is 47:48 linearise ability and I believe the 47:51 definition of linearise ability and 47:53 external consistency are the same so I 47:57 would say yes and another question how 48:01 does this not absolutely blow up storage 48:03 that is a great question and the answer 48:05 is it definitely blows up storage and 48:09 the reason is that now the storage 48:12 system has to keep multiple copies data 48:16 records that have been recently modified 48:17 multiple times and that's definitely 48:20 expense both both this cost and storage 48:23 and space on the disk in the memory and 48:26 also it's just like an added layer of 48:28 bookkeeping you know now lookups have to 48:30 consider the timestamps as well as keys 48:34 the storage expense I think is not as 48:39 great as it could be because the system 48:41 discards old records that paper does not 48:44 say what the policy is but presumably 48:49 well it must be discarding old records 48:51 certainly if the only reason for the 48:53 multiple records is to implement 48:54 snapshot isolation of these kinds of 48:57 transactions then you don't really need 48:59 to remember values too far in the past 49:02 because you only need to remember values 49:06 back to the sort of earliest time that a 49:08 that a transaction could have started at 49:11 that's still running now and if your 49:13 transactions mostly you're always finish 49:15 or force the finish by killing them or 49:18 something within say one minute if no 49:21 transaction can take longer than a 49:22 minute then you only have to remember 49:23 the last minute of versions in the 49:26 database now in fact the paper implies 49:29 that they 49:30 day two farther back than that because 49:32 it appears they support intentionally 49:36 support these snapshot reads which allow 49:39 them to support the notion of seeing you 49:42 know data from a while ago you know 49:44 yesterday or something but they don't 49:46 say but but the garbage collection 49:49 policy is for old values so I don't know 49:52 how expensive it would be for them okay 49:58 okay so the the justification for ice 50:02 legal is that in external consistency 50:03 that the only rule that external 50:06 consistency imposes is that if one 50:09 transaction has completed then a 50:11 transaction that starts after it must 50:13 see its rights so t1 may be t1 completed 50:16 let's say that t1 completed at this time 50:18 and t3 started just after it may be 50:23 external consistency but demand that t3 50:25 sees key ones rights but since c2 50:28 definitely didn't finish before t3 50:30 started we have no obligation under 50:32 external consistency forty-three to see 50:34 teachers rights and indeed in this 50:38 example it does not so it's actually 50:40 legal um okay another problem that comes 50:46 up is that the transaction T 3 is needs 50:51 to read data as of a particular 50:52 timestamp but you know the reason why 50:56 this is desirable is that were it allows 50:58 us to read from the local replicas in 51:00 the same data center but maybe that 51:02 local replica is in the minority of 51:05 paxos followers that didn't see the 51:08 latest log records the leader so maybe 51:11 our local replicas maybe it's never even 51:13 seen you know never saw these rights to 51:16 X&Y at all it's still back at a version 51:19 from pine you know five or six or seven 51:22 and so if we don't do something clever 51:25 when we ask for the sort of highest 51:28 version record you know less than 51:31 timestamp 15 we may get some much older 51:34 version that's not actually the you 51:36 produced by transaction one which were 51:38 required to see 51:41 so the way he spanner deals with this is 51:44 with our notion of safe time and the 51:52 scoop is that each replica remembers you 51:55 know it's getting log records from its 51:58 taxes leader and the log records it 52:02 turns out that the paper arranges so 52:03 that the leader sends out log records 52:05 and strictly increasing timestamp order 52:07 so a replica can look at the very last 52:11 log record it's gotten from its leader 52:12 to know how up to dated it so if I ask 52:17 for a value as of timestamp 15 but the 52:20 replica has only gotten log entries from 52:23 my pax was leader a few times stamp 13 52:26 the replicas gonna make us delay it's 52:28 not gonna answer until it's gotten a log 52:31 record with time stamped 15 from the 52:33 leader and this ensures that replicas 52:36 don't answer a request for a given 52:39 timestamp until they're guaranteed to 52:41 know everything from the leader up 52:43 through that time stamp so this may 52:45 delay this may delay the reads okay 52:58 so the next question I've been assuming 53:01 I assumed in this discussion that the 53:03 clocks and all the different servers are 53:05 perfectly synchronized so everybody's 53:07 clock says you know 1001 and 30 seconds 53:11 at exactly the same time but it turns 53:15 out that you can't synchronize clocks 53:19 that precisely you it's basically 53:27 impossible to get perfectly synchronized 53:29 clocks and the reasons are reasonably 53:35 fundamental so the topic is time 53:39 synchronization which is sort of making 53:41 sure clocks say the same real time value 53:44 different clocks read the same value the 53:53 I'll tell the sort of fundamental 53:57 problem is that time is defined as 53:59 basically the time it says on a 54:01 collection of highly accurate expensive 54:04 clocks in a set of government 54:05 laboratories so we can't directly read 54:07 them although we can know is that these 54:10 government laboratories can broadcast 54:12 the time in various ways and the 54:18 broadcast take time and so it's some 54:20 time later some possibly unknown time 54:22 later we hear these announcements of 54:24 what the time it's own you know it may 54:26 all hear these announcements at 54:27 different times due to varying delays so 54:32 I actually first don't want to consider 54:34 the problem of what the impact is if on 54:37 snapshot isolation if the clocks are not 54:43 synchronize which they won't be 54:52 okay so what if the clocks are 54:56 there's actually no problem at all for 54:58 the spanners readwrite transactions 55:00 because the readwrite transactions used 55:02 locks and two-phase commit they're not 55:04 actually using snaps out of a solution 55:05 so they don't care so the readwrite 55:07 transactions will still be serialized by 55:09 the lock the two-phase locking mechanism 55:11 so we're only interested in what happens 55:14 for an RF or read-only transaction so 55:18 let's suppose a read-only transaction 55:22 chooses a timestamp that is too large so 55:27 that is far in the future you know it's 55:29 now 12:01 p.m. and it chooses a 55:31 timestamp at C 1 o'clock p.m. so if a 55:39 transactions chosen timestamps too big 55:42 that's actually not that bad what it'll 55:46 mean is that it will do read requests 55:48 it'll send a read request to some 55:50 replicas the replicas would say wait a 55:52 minute you're you know your clock is 55:53 Farrer it's far greater your chime seems 55:56 far greater than the last log entry I 55:58 saw for my pax was leader so I'm gonna 56:00 make you wait until the PAX was at the 56:03 time and the log entries and the Paxos 56:04 leader catches up to the time you've 56:05 requested I'm only gonna respond then so 56:08 this is correct but slow the reader will 56:11 be forced away that's not the worst 56:16 in the world but what happens if we have 56:19 a read-only transaction and it's 56:21 timestamp is too small and this would 56:27 correspond to its clock being less 56:30 either set wrong so that it's said in 56:31 the past or maybe it was originally set 56:34 correctly but the clock its clock ticks 56:36 too slowly the problem with this this is 56:39 a obviously causes a correctness problem 56:41 this will cause a violation of external 56:42 consistency because the multi version 56:46 databases you'll give it a timestamp 56:47 that's far in the past say an hour ago 56:50 and the database will read you a value 56:53 associated with it the timestamp from an 56:55 hour ago which may ignore more recent 56:58 writes so using a assigning a timestamp 57:01 to a transaction that's too small will 57:03 cause you to miss recent committed 57:05 writes and that's a violation of 57:11 external consistency so not externally 57:21 so so we actually have a problem here 57:24 the assumption that the clocks were 57:26 synchronized is in fact a very serious 57:29 assumption and the fact that you cannot 57:31 count on it means that unless we do 57:33 something the system is going to be 57:35 incorrect all right so so can we 57:43 synchronize clocks perfectly all right 57:46 that would be the ideal thing and if not 57:48 why not so so what about clock 57:51 synchronization the as I mentioned we're 57:59 done come from this it's actually a 58:01 collection of the kind of median of a 58:03 collection of clocks and government labs 58:06 the way that we hear about the time is 58:09 that it's broadcast by various protocols 58:11 sometimes by radio protocols like 58:13 basically what GPS is doing for spanner 58:16 is a GPS acts as a radio broadcast 58:19 system that broadcasts the current time 58:22 from some government lab through the GPS 58:24 satellites to GPS receiver sitting in 58:27 the Google machine rooms and there's a 58:31 number of other radio protocols like WWB 58:34 is another older radio protocol for 58:37 broadcasting the current time and 58:39 there's newer protocols like there's 58:41 this NTP protocol that operates over the 58:46 Internet that also is in charge of 58:48 basically broadcasting time so the sort 58:51 of system diagram is that there are some 58:54 government labs and the government labs 58:57 with their accurate clocks define a 58:59 universal notion of time that's called 59:02 UTC so we've UTC coming from some clocks 59:07 in some labs then we have some you know 59:09 radio internet broadcast or something 59:11 for the case of spanner it's the we can 59:19 think of the government allowed to 59:20 broadcasting to GPS satellites the 59:25 satellites in turn broadcast and the 59:28 broadcaster you know the millions of GPS 59:31 receivers that are out there 59:33 you can buy GPS receivers for a couple 59:37 hundred bucks that will decode the 59:38 timestamps in the in the GPS signals and 59:44 sort of keep you up to date with exactly 59:46 what the time is corrected for the 59:49 propagation delay between the government 59:51 labs and the GPS satellites and also 59:53 corrected for the delay between the GPS 59:55 satellites in your current position and 59:59 then there's in each data center there's 60:04 a GPS receiver that's connected up to 60:10 what the paper calls a time master which 60:14 is some server there's going to be more 60:16 than one of these for data center in 60:18 case one fails and then there's all the 60:21 hundreds of servers in the data center 60:22 that are running spanner either as 60:24 servers or as clients each one of them 60:27 is going to periodically send a request 60:33 saying aw what time is it to the local 60:34 one or more usually more than one piece 60:37 one feels to the time masters and the 60:41 time master will reply with oh you know 60:43 I think the current time has received 60:44 for GPS is such-and-such now built into 60:51 this unfortunately is a certain amount 60:52 of uncertainty and the primary sources 60:59 of uncertainty I think well there's 61:01 there's fundamentally uncertainty in 61:03 that we don't actually know how far we 61:05 are from the GPS satellites exactly so 61:08 the you know radio signals take some 61:10 amount of time even though the GPS 61:12 satellite knew exactly what time it is 61:14 those signals take some time to get to 61:15 our GPS receiver we're not sure what 61:17 that is that means that when the Jeep we 61:19 get a message from the radio message 61:22 from the GPS satellite saying exactly 12 61:25 o'clock 61:25 you know if the propagation delay might 61:28 have been you know a couple of 61:30 nanoseconds that mean that's there were 61:32 actually the propagation delays much 61:34 more than that it's really uncertainty 61:35 in the propagation delay means that 61:38 we're not really sure exactly whether 61:40 it's 12 o'clock or a little before a 61:41 little after in addition all the times 61:44 at time is communicated there's 61:46 did uncertainty that you have to account 61:49 for and the biggest sources are that 61:52 when a server sends requests after a 61:54 while it gets a response if the response 61:56 says it's exactly 12 o'clock but the 62:01 amount but um say a second pass you know 62:04 between when the server sent the request 62:06 and when I got the response all the 62:08 server knows is that even if the master 62:11 had the correct time all the server 62:13 knows is that the time is within a 62:18 second of 12 o'clock because maybe that 62:22 may be the request was instant but the 62:24 reply was delayed or maybe the request 62:27 was delayed by a second and the response 62:30 was the incident so all you really know 62:31 is that it's between you know 12 o'clock 62:34 and zero seconds and twelve o'clock and 62:38 one second okay so there's always this 62:45 uncertainty and in order to which we 62:49 really can't ignore though because the 62:50 uncertainties we're talking about 62:52 milliseconds here and we're gonna find 62:55 out that these that the uncertainty and 62:57 the time goes directly to the these how 63:00 long these safe waits have to be and how 63:02 long some other pauses have to be the 63:03 commit wait as we'll see so you know 63:08 uncertainty in the level of milliseconds 63:10 is a serious problem the other big 63:11 uncertainty is that each of these 63:13 servers only request the current time 63:15 from the master every once in a while 63:16 say every minute or however often and 63:20 between that the each server runs its 63:22 own local clock that sort of keeps the 63:25 time starting with the last time from 63:26 the master those local clocks are 63:28 actually pretty bad and can drift by 63:32 things by milliseconds between times 63:34 that the server talks to the master and 63:36 so the system has to sort of add the 63:40 unknown but estimated drift of the local 63:44 clock to the uncertainty of the time so 63:48 I'm in order to capture this uncertainty 63:50 and account for it 63:55 spanner uses this true time scheme in 64:00 which when you ask what time it is what 64:01 you actually get back as one of these TT 64:03 interval things which is a pair of an 64:12 earliest time and a latest earliest time 64:18 is their early early as the time could 64:21 possibly be and the second is the latest 64:25 the time can possibly be so when the 64:27 application you know makes this library 64:31 call that asked for the time it gets 64:32 back this payer all it knows is that the 64:34 current time is somewhere between 64:35 earliest and latest that's what you know 64:38 earliest might be in this case earliest 64:39 might be twelve o'clock and may this 64:41 might be twelve o'clock in one second 64:42 just just our guarantee that the that 64:46 the correct time isn't less than 64:48 earliest and isn't greater than latest 64:51 what we don't know where between 64:53 otherwise okay 64:57 so this is what uh when a transaction 65:01 asks the system what time it is this is 65:03 this is what the transaction actually 65:05 gets back from the time system and now 65:11 let's return to our original problem was 65:14 that if the clock was too slow that a 65:17 read-only transaction might read data 65:20 too far in the past and that it wouldn't 65:23 read data from a recent committed 65:25 transaction so we need to know what 65:27 we're looking for is how spanner uses 65:29 these TT intervals in its notion of true 65:32 time in order to ensure that despite 65:34 uncertainty in what time it is 65:36 transaction a external consistency that 65:40 is a read-only transaction it's 65:42 guaranteed to see writes done by a 65:45 transaction rate transaction that 65:47 completed before us and there are two 65:52 rules that the paper talks about that 65:55 conspire to enforce this and the two 66:01 rules which are in section 4-1 - one of 66:04 them is the start rule 66:07 and the other is commit wait 66:16 this note rule tells us what time stamps 66:20 trains actually what time stamps 66:23 transactions choose and basically says 66:26 that a transactions timestamp has to be 66:29 equal to the latest half of the true 66:35 time current time so this is T T now 66:38 call which returns one of those earliest 66:40 latest pairs that's the current time and 66:43 that transactions timestamp has to be 66:45 the latest that is it's going to be a 66:48 time that's guaranteed not to have 66:50 happened yet because the true time is 66:52 between earliest and latest and for a 66:54 read-only transaction it's a sign the 66:59 latest time as of its the time it starts 67:03 and for a read or write transaction is 67:06 to assign a timestamp this latest value 67:09 as of the time it starts to commit 67:16 okay so the start rule says this is how 67:18 spanner chooses time stamps the commit 67:21 weight rule only for readwrite 67:24 transactions says that when a 67:31 transaction coordinator is you know 67:35 collects the votes and sees that it's 67:36 able to commit and and chooses a time 67:39 stamp after it chooses this time stamp 67:41 it's required to delay to wait a certain 67:44 amount of time before til I have to 67:45 actually commit and write the values and 67:47 release locks so a readwrite transaction 67:52 has to delay until it's time stamps that 67:58 it chose when it was starting to think 68:00 about commit is less than the current 68:02 time the earliest 68:11 sorry 68:13 so what's going on here is the 68:14 sits in a loop calling TS now and it 68:17 stays in that loop until the timestamp 68:20 that it had chosen at the beginning of 68:21 the commit process is less than the 68:23 current times earliest half and what 68:25 this guarantees is that since now the 68:30 earliest possible correct time is 68:34 greater than the transactions timestamp 68:36 that means that when this loop is 68:38 finished when the commit wait is 68:39 finished this time stamp of the 68:41 transaction is absolutely guaranteed to 68:43 be in the past okay so how does the 68:49 system actually make use of these two 68:52 rules in order to enforce external 69:00 consistency for read-only transactions I 69:02 want to go back to our or I want to cook 69:09 up a someone simplified scenario in 69:14 order to illustrate this so I'm gonna 69:17 imagine that the writing transactions 69:18 only do one write each just reduce the 69:21 complexity let's say that there's two 69:24 read/write transactions so we have t0 69:27 and t1 are read/write transactions and 69:32 they both write X and we have a t2 which 69:35 is going to read X and we want to make 69:37 sure that t2 sees you know it's going to 69:40 use snapshot isolation on timestamps we 69:42 want to make sure that sees the latest 69:43 written value so we're going to imagine 69:48 that t2 does a write of X and writes one 69:51 to X and then commits we're going to 69:56 imagine that 69:58 sorry t1 write sex and come at t2 also 70:00 writes X writes a value 2 to X and we 70:05 need to distinguish between can prepare 70:07 and commit so we're going to say it it's 70:09 really a prepare that the transaction 70:11 chooses its timestamps so this is a 70:14 point at which it chooses timestamp and 70:16 it commits some time later and then 70:19 we're imagining by assumption that t2 70:21 starts after t1 finishes so it's going 70:24 to read X 70:27 afterwards and we want to make sure it 70:29 sees - all right so let's suppose that 70:34 t0 chooses a time stamp of one commits 70:40 writes the database let's say t1 starts 70:46 at the time it chooses a time stamp it's 70:49 gonna get some it's not get a single 70:51 number from the true time system really 70:53 gets a range of numbers you know 70:57 earliest and a latest value let's say at 71:02 the time it chooses its time stamp it 71:04 the range of values that earliest time 71:09 it gets is 1 and the latest field in the 71:12 current time is 10 so the rule says that 71:17 it must choose 10 the latest value as 71:20 its time stamp so t1 is gonna commit 71:22 with its time step 10 71:24 now you can't commit yet because the 71:27 commit weight rule says it has to wait 71:29 until it's time stamp is guaranteed to 71:32 be in the past so transaction 1 is going 71:35 to sit there keep asking what time is it 71:37 what time is it 71:37 until it gets an interval back that 71:41 doesn't include time 10 so at some point 71:45 it's gonna ask what time it is is gonna 71:48 get a time that we're the earliest 71:49 values 11 and elitist is I don't know 71:51 let's say 20 and now I was gonna say AHA 71:54 now I know that my time Sam it's 71:56 guaranteed to be in the past and I can 71:57 commit so t1 will actually this is its 72:00 commit wait period to sit there and wait 72:03 for a while before it commits okay now 72:07 after it commits transaction two comes 72:10 along a monster B Dex it's gonna choose 72:13 a time stamp also we're assuming that it 72:16 starts after t1 finishes because that's 72:19 the interesting scenario for external 72:21 consistency so let's say when it asks 72:23 for the time it asks at a time after 72:28 time 11 so it's going to get back an 72:30 interval that includes time 11 72:34 so let's suppose it gets back in a 72:35 little bit goes from time ten this is 72:39 the earliest and time twelve the latest 72:43 and of course the time twelve has to be 72:45 since we know that must be at least time 72:47 11 since transaction two started after 72:50 transaction one finished that means that 72:53 the 11th must be less than the latest 72:55 value transaction 2 is going to choose 72:59 this latest 1/2 as its timestamp so it's 73:02 gonna actually choose timestamp 12 and 73:09 in this example when it does its read 73:12 it's gonna ask the storage system oh I 73:15 want to read as of timestamp 12 since 73:18 transaction 1 wrote with timestamp 10 73:20 that means that you know assuming the 73:22 safe wait the safe time machinery works 73:25 we're actually gonna read the correct 73:27 value and what's going on here is that 73:33 the so this happened to work out but 73:37 indeed it's guaranteed to work out if 73:39 transaction 2 as long as transaction 2 73:41 starts after transaction 1 commits and 73:43 the reason is that commit weight causes 73:47 transaction 1 not to finish committing 73:49 until its timestamp is guaranteed to be 73:52 in the past 73:53 all right so transaction 1 chooses a 73:55 timestamp it's guaranteed to commit 73:59 after that timestamp transaction 2 74:05 starts after the commit it and so we 74:10 don't know anything about what its 74:11 earliest value will be but its latest 74:14 value is guaranteed to be after the 74:16 current time but we know that the 74:17 current time is after the commit time of 74:19 T 1 and therefore that teaches latest 74:23 value the timestamp it chooses is 74:26 guaranteed to be after when C committed 74:29 and therefore after the timestamp that C 74:34 used and because transaction 2 if 74:37 transaction 2 starts after T 1 finishes 74:40 transaction 2 is guaranteed to get a 74:42 higher timestamp 74:45 and the snapshot isolation machinery the 74:46 multiple versions will cause it to read 74:49 to it's read to see all lower valued 74:53 writes from all the lower time-stamped 74:55 transactions that means teach you is 74:56 going to see t1 nom and that basically 74:59 means that we're this this is how 75:01 spanner enforces external consistency 75:04 for its transactions so any questions 75:09 about this machinery alright um I'm 75:18 gonna step back a little bit there's 75:22 really from my point of view sort of two 75:25 big things going on here one is snapshot 75:27 isolation by itself snapshot isolation 75:30 by itself is enough to give you that 75:32 it's keeping the multiple versions and 75:35 giving every transaction a timestamp 75:36 snapshot isolation is guaranteed to give 75:38 you serializable read-only transactions 75:41 because basically what snapshot 75:43 isolation means is that we're going to 75:45 use these timestamps as the equivalent 75:47 serial order and things like the safe 75:50 wait the safe time ensure that read-only 75:54 transactions really do read as of their 75:57 time stamps see every readwrite 75:59 transaction before that and none after 76:01 that so there's really two pieces 76:04 snapshot isolation snapshot isolation by 76:08 itself though is actually often used not 76:11 just by spanner but generally doesn't by 76:14 a self guarantee external consistency 76:16 because in a distributed system it's 76:18 different computers choosing the 76:20 timestamp so we're not sure there's 76:22 timestamps will obey external 76:24 consistency even if they'll deliver 76:26 serialize ability so in addition to 76:29 snapshot isolation spanner also has 76:32 synchronized timestamps and it's the 76:34 synchronized timestamps plus the commit 76:37 weight rule that allow spanner to 76:40 guarantee external consistency as well 76:43 as serializability and again the reason 76:48 why all this is interesting is that 76:49 programmers really like transactions and 76:52 I really like external consistency 76:53 because that makes the applications much 76:55 easier to write 76:57 they traditionally not been provided in 76:59 distributed settings because they're too 77:01 slow and so the fact that spanner 77:03 manages to release make read-only 77:04 transactions very fast is extremely 77:08 attractive right no locking no two-phase 77:10 commit and not even any distant reads 77:12 for a read-only transactions they 77:14 operate very efficiently from the local 77:16 replicas and again this is what's good 77:19 for a basically attend factor of 10 77:22 latency improvement as measured in 77:25 tables 3 & 6 but just to remind you it's 77:29 not all it's not all fabulous the the 77:34 all this wonderful machine it really 77:36 only applies to read-only transactions 77:38 readwrite transactions still use 77:40 two-phase commit and locks and there's a 77:43 number of cases in which even spanner 77:45 will have the block like due to the safe 77:47 time and the commit wait but as long as 77:50 their times are accurate enough 77:53 these commit weights are likely to be 77:55 relatively small okay just to summarize 77:59 the spanner at the time was kind of a 78:03 breakthrough because it was very rare to 78:05 see deployed systems that operate 78:07 distributed transactions where the data 78:10 was geographically in very different 78:14 data centers I'm surprising you know 78:17 spanner people were surprised that 78:19 somebody was using a database that 78:21 actually did a good job of this and that 78:23 the performance was tolerable and the 78:26 snapshot isolation and a timestamp being 78:28 part of the probably the most 78:30 interesting aspects of the paper and 78:35 that is all I have to say for today any 78:40 last questions okay 78:49 I think on 78:51 we're gonna we're going to see farm 78:53 which is a sort of very different slice 78:57 through the desire to provide very high 79:00 performance transactions so I'll see you 79:05 on Thursday