字幕記錄 00:00 I'd like to like to talk about farms day 00:02 and optimistic concurrency control which 00:06 is the main interesting technique that 00:07 uses the reason we're talking about farm 00:10 it's this the last paper in the series 00:14 about transactions and replication and 00:16 sharding and this is still an open 00:19 research area where people are totally 00:21 not satisfied with performance or in the 00:25 kind of performance versus consistency 00:30 trade-offs that are available and 00:31 they're still trying to do better 00:32 and in particular this particular paper 00:34 is motivated by the huge performance 00:36 potential of these new RDMA NICs 00:41 so you may be wondering since we just 00:43 read about spanner 00:44 how farm differs some spanner both of 00:48 them after all replicate and they use 00:49 two-phase commit for transactions of 00:51 that level they seem pretty similar 00:54 spanner as a is a deployed systems been 00:56 used a lot for a long time its main 01:00 focus is on Geographic replication that 01:02 is to be able to have copies on there 01:04 like east and west coasts and different 01:06 data centers and be able to have 01:09 reasonably efficient transactions that 01:11 involve pieces of data in lots of 01:13 different places and the most innovative 01:16 thing about it because in order to try 01:18 to solve the problem of how long it 01:20 takes to do two-phase commit over long 01:22 distances is that it has a special 01:26 optimization path for read-only 01:27 transactions using synchronized time and 01:32 the performance you get out of spanner 01:34 if you remember is that a read/write 01:36 transaction takes 10 to 100 milliseconds 01:40 depending on how close together the 01:42 different data centers are farm makes a 01:46 very different set of design decisions 01:49 and targets a different kind of workload 01:50 first of all it's a research prototype 01:52 so it's not by any means a finished 01:54 product and the goal is to explore the 01:57 potential of these new RDMA high speed 02:00 networking hardware so it's really still 02:04 an exploratory system it assumes that 02:08 all replicas are in the same data center 02:09 absolutely it doesn't wouldn't make 02:11 sense 02:11 the replicas were in even in different 02:13 data centers let alone on East Coast 02:15 versus West Coast so it's not trying to 02:18 solve a problem that spanner is about 02:20 what happens if an entire data center 02:22 goes down can I so get out my data 02:23 really that's does the extent that it 02:25 has fault tolerance is for individual 02:27 crashes or maybe try to recover after a 02:30 whole data center loses power and gets 02:33 restored again it uses this RDMA 02:37 technique which I'll talk about but 02:39 already may turns out to seriously 02:41 restrict the design options and because 02:43 of this farm is forced to use optimistic 02:46 concurrency control on the other hand 02:49 the performance they get is far far 02:52 higher than spanner farm can do a 02:56 transit a simple transaction in 58 02:58 microseconds and this is from figure 7 03:00 and section 6.3 so this is 58 03:03 microseconds versus to 10 milliseconds 03:06 that the spanner takes is that's about a 03:10 hundred times faster than spanner so 03:12 that's maybe the main huge differences 03:15 that farm us how much higher performance 03:17 but is not aimed at Geographic 03:21 replication so this you know farms 03:26 performance is extremely impressive like 03:28 how much faster than anything else 03:31 another way to look at it is that 03:33 spanner and farm target different 03:35 bottlenecks and span are the main 03:37 bottleneck the people worried about is 03:39 the speed of light and network speed of 03:41 light delays and network leaves between 03:42 data centers whereas in farm the main 03:46 bottlenecks that the design is worried 03:50 about is is CPU time on the server's 03:52 because they kind of wished away the 03:54 speed of light and network delays by 03:55 putting all the replicas in the same 03:57 data center all right 04:01 so sort of the background of how this 04:03 fits into the 684 sequence the setup and 04:08 farm is that you have it's all running 04:10 in one datacenter there's a sort of 04:16 configuration manager this which we've 04:18 seen before and the configuration 04:21 managers in charge of deciding which rep 04:24 which 04:25 servers should be the primary in the 04:26 backup before each shard of data and if 04:30 you read carefully you'll see that they 04:31 use zookeeper in order to help them 04:37 implement this configuration manager but 04:38 it's not not the focus of the paper at 04:40 all 04:40 instead the interesting thing is that 04:42 the data is sharded split up by key 04:44 across a bunch of primary backup payers 04:47 so I mean one shard goes on you know 04:50 primary one server primary one backup 04:52 one another short one primary to backup 04:55 two and so forth and that means that 04:59 anytime you update data you need to 05:01 update it both on the primary and on the 05:03 backup and these are not these primaries 05:06 these replicas are not maintained by 05:08 PACs or anything like it instead all the 05:11 replicas of the data are updated 05:14 whenever there's a change and if you 05:16 read you always have to read from the 05:18 primary the reason for this replication 05:21 of course is fault tolerance and the 05:24 kind of fault tolerance they get is that 05:26 as long as one replicas of a given shard 05:29 is available then that shard will be 05:31 available so they only require one 05:33 living replica not a majority and the 05:37 system as a whole if there's say a data 05:39 center white power failure it can 05:42 recover as long as there's at least one 05:43 replicas of every shard in the system 05:47 another way of putting that is if you 05:49 they have F plus one replicas then they 05:52 can tolerate up to F failures for that 05:54 shard in addition to the primary backup 05:58 copies of each sort of data there's 06:00 transaction code that runs it's maybe 06:04 most convenient to think of the 06:06 transaction code is running as separate 06:07 clients in fact they run the transaction 06:11 code in their experiments on the same 06:13 machines as the actual farm storage 06:16 servers but I'll mostly think of them as 06:21 as being a separate set of clients and 06:24 the clients are running transactions and 06:27 the transactions need to read and write 06:29 data objects that are stored in the 06:34 in the sharded servers in addition these 06:38 transaction these clients each client 06:40 not only runs the transactions but also 06:42 acts as that transaction coordinator for 06:45 two-phase commit 06:48 okay so it's the basic set up the way 06:52 they get performance because this really 06:53 this is a paper all about how you can 06:55 get high performance and still have 06:57 transactions one way they get high 06:59 performances with sharding these are the 07:04 ingredients in a sense the main way is 07:09 through sharding in experiments they 07:12 shard their data over 90 ways for 90 07:14 servers or maybe it's 45 ways and not 07:17 just if as long as the operations and 07:19 different shards are more or less 07:21 independent of each other that just gets 07:23 you an automatic 90 times speed up 07:25 because you can run whatever it is 07:27 you're running in parallel on 90 syrups 07:30 this huge went from shorter sharding um 07:34 another trick they play in order to get 07:36 good performance as the data all has to 07:38 fit in the RAM of the servers they don't 07:41 really store the data on disk 07:44 it all has to fit in RAM and that means 07:46 of course you can get out of pretty 07:46 quickly another way that they get high 07:50 performance is they need to tolerate 07:53 power failures which means that they 07:56 can't just be using RAM because they 07:57 need to recover the data after a power 07:59 failure and RAM loses contents on a 08:02 power failure so they have a clever 08:04 non-volatile Ram scheme for having the 08:09 contents of RAM the data survived power 08:11 failures this is in contrast to storing 08:14 the data persistently on disk i'm is 08:16 much faster than disk um another trick 08:20 they play is they use this RDMA 08:22 technique which essentially clever 08:25 network interface cards that allow that 08:31 accept packets that instruct that then 08:34 that we're interface card to directly 08:35 read and write the memory of the server 08:37 without interrupting the server I know 08:42 that trick they play is what you often 08:45 call kernel bypass 08:48 which means that the application level 08:54 code can directly access the network 08:58 interface card without getting the 09:00 kernel involved okay so these are all 09:02 the sort of clever tricks we're looking 09:05 at out pour it that they used to get 09:07 high performance and I'll talk about 09:09 we've already talked about sharding a 09:11 lot but I'll talk about the rest in this 09:13 lecture 09:15 okay so first I'll talk about 09:19 non-volatile Ram I mean this is really a 09:21 topic that doesn't doesn't really affect 09:27 the rest of the design directly as I 09:31 said all the data and for farm is stored 09:34 in RAM when you update it when a client 09:37 transaction updates a piece of data what 09:38 that really means is it reaches out to 09:39 the relevant servers that store the data 09:41 and causes those servers to modify the 09:43 whatever object is the transaction is 09:46 modifying to object modify it right in 09:49 RAM and that's as far as the writes get 09:51 they don't go to disk and this is you 09:53 know contrast to your raft 09:54 implementations for example which spent 09:56 a lot of time persisting data to disk 09:59 there's no persisting and in farm this 10:04 is a big wind writing stuff in RAM write 10:06 a write to ram takes about 200 10:07 nanoseconds whereas a raid even to a 10:09 solid state drive which is pretty fast a 10:11 right to a stall seek drive takes about 10:14 a hundred microseconds and a write to 10:17 our hard drive takes about ten 10:18 milliseconds so being able to write to 10:21 ram is worth many many orders of 10:24 magnitude and speed for transactions 10:26 that modify things but of course iran 10:27 loses its content and a power failure so 10:30 it's not persistent by itself as a side 10:34 you might think that writing 10:37 modifications to the RAM of multiple 10:41 servers that if you have replica servers 10:43 and you update all the replicas that 10:45 that might be persistent enough and so 10:48 after all if you have F 1 F +1 replicas 10:50 you can tolerate up to F failures and 10:53 the reason why just simply writing to 10:55 Ram on multiple servers is not good 10:57 enough is that a site-wide power failure 10:59 will destroy 11:00 all of your servers and thus violating 11:06 the assumption that the failures are in 11:09 different servers are independent so we 11:11 need a scheme that it's gonna work even 11:12 if power fails to the entire data center 11:16 so what what forum does is it it puts a 11:24 battery a big battery in every rack and 11:26 runs the power supply system through the 11:28 batteries so the batteries automatically 11:31 take over if there's a power failure and 11:32 keep all their machines running at least 11:34 until the battery fails but of course 11:37 you know the battery is not very big it 11:39 may only be able to run their their 11:41 machines for say 10 minutes or something 11:44 so the battery by itself is not enough 11:45 to make this the system be able to 11:47 withstand a lengthy power failure so 11:50 instead the battery system when it sees 11:53 that the main power is failed the 11:56 battery system while it keeps the 11:57 server's Marling also alerts the 12:00 server's all the servers and with some 12:02 kind of interrupt or message telling 12:04 them look the powers just failed you 12:06 know you only got 10 minutes left before 12:09 the batteries fail also so at that point 12:12 the software on farms servers copies all 12:16 of rain active stops all processing it 12:18 for farm first and then copies each 12:21 server copies all of its RAM to a 12:23 solid-state drive attached to that 12:25 server I'm what wished could take a 12:27 couple minutes and once all the RAM is 12:30 copied to the solid-state drive then the 12:32 machine shuts itself down and turns 12:33 itself off so if all goes well there's a 12:37 site-wide power failure all the machines 12:39 save their RAM to disk when the power 12:43 comes back up in the datacenter all the 12:45 machines will when they reboot will read 12:49 the memory image that was saved on disk 12:51 restored into RAM and but there's some 12:54 recovery that has to go on but basically 12:57 they won't have lost any of their 12:58 persistent state due to the power 13:00 failure and so what that really means is 13:03 that the farm is using conventional Ram 13:07 but it's essentially made the RAM 13:10 non-volatile being able to survive power 13:13 failures with the 13:14 this trick of using a battery having a 13:17 battery alert the server having the 13:18 server store the RAM content solid-state 13:21 drives any questions about the nvram 13:26 scheme alright this is a is a useful 13:33 trick but it is worthwhile keeping mind 13:37 that it really only helps if there's 13:40 power failures that is if the you know 13:44 the whole sequence of events only it 13:46 gets set in train when the battery 13:47 notices that the main power is failed if 13:50 there's some other reason 13:51 causing the server to fail like 13:53 something goes wrong with the hardware 13:55 or there's a bug in the software that 13:57 causes a crash those crashes the 14:00 non-volatile Ram system is just nothing 14:02 to do with those crashes those crashes 14:04 will cause the machine to reboot and 14:06 lose the contents of its RAM and it 14:08 won't be able to recover them so this 14:10 NVRAM scheme is good for power failures 14:14 but not other crashes and so that's why 14:15 in addition to the NVRAM farm also has 14:19 multiple copies multiple replicas of 14:21 each shard all right so this NVRAM 14:26 scheme essentially eliminates 14:28 persistence rates as a bottleneck in the 14:32 performance of the system leaving only 14:35 as performance bottlenecks the network 14:37 and the CPU which is what we'll talk 14:39 about next ok so there's a question if 14:43 the datacenter power fails and farm lose 14:49 everything for solid-state drive would 14:52 it be possible to carry all the data to 14:53 a different data center and continue 14:55 operation there in principle absolutely 15:01 in practice I think would be would all 15:05 certainly be easier to restore power to 15:07 the data center then to move the drives 15:10 the problem is there's no power and the 15:12 power in the dated old data center so 15:14 you'd have to physically move the drives 15:16 and the computers maybe just the drives 15:19 to the new data center so this was if 15:21 you wanted to do this it might be 15:23 possible but it's certainly not 15:27 it's not what the farm designers had in 15:29 mind they assumed the power be restored 15:33 okay so that's NVRAM and at this point 15:38 we can just ignore nvram for the rest of 15:40 the design it doesn't it doesn't really 15:45 interact with the rest of the design 15:46 except that we know we're have to worry 15:48 about writing data to disk all right so 15:54 as I mentioned the remaining bottlenecks 15:56 once you eliminate having a great data 15:59 to disk for persistence in remaining 16:00 bottlenecks have to do with the CPU and 16:02 the network in fact in farman and indeed 16:07 a lot of the systems that i've been 16:09 involved with the a huge bottleneck has 16:13 been the cpu time required to deal with 16:16 network interactions so now we're can 16:18 CPU are kind of joint bottlenecks here 16:21 farm doesn't have any kind of speed of 16:23 light network problems it just has the 16:27 problems or it just spends a lot of time 16:30 eliminating bottlenecks having to do is 16:31 getting network data into and out of the 16:34 computers so first as a background I 16:38 want to lay out what the conventional 16:40 architecture is for getting things like 16:43 remote procedure call packets between 16:46 applications and on different computers 16:51 just so that can we have an idea of why 16:54 this approach that farm takes is more 16:56 efficient so typically what's going on 16:58 is on one computer that maybe wants to 17:03 send a procedure call message you might 17:05 have an application and then the 17:09 application is running in user space 17:11 there's a user kernel boundary here the 17:15 application makes system calls into the 17:17 kernel which are not particularly cheap 17:19 in order to send data and then there's a 17:22 whole stack of software inside the 17:24 kernel involved is sending data over the 17:26 network there might be what's usually 17:29 called a socket layer that does 17:32 buffering which involves copying the 17:36 data which takes time there's typically 17:38 a complex TCP 17:40 the protocol stack that knows all about 17:42 things like retransmitting and sequence 17:45 numbers and check sums and flow control 17:49 there's quite a bit of processing there 17:51 at the bottom there's a piece of 17:54 hardware called the network interface 17:57 card which is has a bunch of registers 18:00 that the kernel can talk to to configure 18:04 it and it has hardware required to send 18:07 bits out over the cable onto the network 18:09 and so there's some sort of network 18:11 interface card driver in the kernel and 18:15 then all self respecting that we're 18:18 gonna price cards use direct memory 18:19 access to move packets into and out of 18:22 host memory so there's going to be 18:23 things like queues of packets that the 18:26 network interfaces card has D made into 18:28 memory the waiting for the kernel to 18:30 read and outgoing hues the packets that 18:33 the kernel would like then that we're 18:34 going to face to car to send as soon as 18:36 convenient all right so you want to send 18:39 a message like an RPC request let's go 18:41 down from the application through the 18:43 stack network interface card sends the 18:45 bits out on a cable and then there's the 18:48 reverse stack on the other side isn't 18:51 network interface Hardware here in the 18:55 kernel then organ or face might 18:57 interrupt the kernel kernel runs driver 19:00 Code which hands packets to the TCP 19:02 protocol which writes them into buffers 19:06 waiting for the application to read them 19:08 at some point the application gets 19:10 around reading them makes system calls 19:12 into the kernel copies the data out of 19:15 these buffers into user space this is a 19:19 lot of software it's a lot of processing 19:21 and a lot of fairly expensive CPU 19:24 operations like system calls and 19:26 interrupts and copying data as a result 19:29 so classical Network communication is 19:32 relatively slow it's quite hard to build 19:35 an RPC system with the kind of 19:37 traditional architecture that can 19:39 deliver more than say a few hundred 19:41 thousand or BC messages per second and 19:45 that might seem like a lot but it's 19:47 orders of magnitude too few for the kind 19:49 of performance that farm is trying to 19:51 target and in general that 19:53 couple hundred thousand our pcs per 19:54 second is far far less than the speed 19:58 that the actual network hardware like 20:01 Network wire in the network interface 20:03 card is capable of typically these 20:05 cables run at things like 10 gigabits 20:07 per second it's very very hard to write 20:11 our PC software that can generate small 20:12 messages of the kind that databases 20:16 often need to use it's very hard to 20:18 write software in this style that can 20:20 generate or absorb anything like 10 20:24 gigabits per second of messages that's 20:30 millions maybe tens of millions of 20:32 messages per second ok so this is the 20:34 plan that farm doesn't use and a sort of 20:37 a reaction to to this plan instead farm 20:45 uses - - ideas to reduce the costs of 20:52 pushing packets around the first one 20:55 I'll call kernel bypass and the idea 21:01 here is that instead of the application 21:05 sending all its data down through a 21:06 complex stack of kernel code instead the 21:15 application the kernel configures the 21:18 protection machinery in the computer to 21:21 allow the application direct access to 21:24 network interface card so the 21:26 application can actually reach out and 21:28 touch the network interfaces registers 21:30 and tell it what to do in addition the 21:33 network interface card when it DMAs 21:35 and this kernel bypass scheme it DNA's 21:38 directly into application memory where 21:41 the application can see the bytes 21:42 arriving directly without kernel 21:46 intervention and when the application 21:47 needs to send data the application can 21:49 create queues that the network interface 21:53 card can directly read with DMA and send 21:56 out over the wire so now we've 21:58 completely eliminated all the kernel 22:01 code involved in networking kernels just 22:03 not involved there's no system calls 22:04 there's no interrupts 22:06 the application just directly reason why 22:08 it's memory that the network interface 22:09 card sees and of course the same thing 22:11 on the other side and this is a this is 22:21 an idea that is actually was not 22:25 possible years ago with network 22:27 interface cards but most modern serious 22:30 network interface cards okay can be set 22:33 up to do this it does however require 22:34 that the application you know you know 22:37 all those things that TCP was doing for 22:39 you like check sums or retransmission 22:43 the application would now be in charge 22:46 if we wanted to do this you can actually 22:49 do this yourself kernel bypass using a 22:55 toolkit that you can find on the way up 22:57 called 22:57 DP DK and it's relatively easy to use 23:02 and allows people to write extremely 23:04 high performance networking applications 23:08 but and so so form does use this it's 23:12 applications directly you talk to the 23:14 neck the neck DM ace things right into 23:15 application memory we have a student 23:19 question I'm sorry yes does this mean 23:22 that farm machines run a modified 23:24 operating system well I I don't know the 23:31 actual answer that question I believe 23:32 farm is runs on Windows some form of 23:36 Windows whether or not they had to 23:38 modify Windows I do not know in the sort 23:44 of Linux world in Linux world there's 23:46 already full support for this 23:48 it does require kernel intervention 23:50 because the kernel has to be willing to 23:54 give ordinarily application code cannot 23:56 do anything directly with devices so 23:58 Linux has had to be modified to allow 24:01 the allow the kernel to delegate 24:05 hardware access to applications so it 24:08 does require kernel modifications those 24:12 monitor occasions are already in Linux 24:13 and maybe already in Windows also in 24:16 addition though this 24:18 on fairly intelligent Knicks because of 24:21 course you're going to have multiple 24:22 applications that want to play this game 24:23 with a network interface card and so 24:25 modern NICs actually know about talking 24:27 to multiple distinct cues so that you 24:30 can have multiple applications each with 24:32 its own set of cues and the the Nick 24:33 knows about so it did it has required 24:36 modification of a lot of things okay 24:41 so sort of step one is is Colonel bypass 24:47 idea step two is even cleverer next and 24:50 now we're starting to get into hardware 24:51 that is not in wide use of the moment 24:54 you can buy it commercially but it's not 24:59 the default is this RDMA scheme which is 25:03 remote direct memory access and here 25:11 this is sort of special kind of network 25:18 interface cards that support remote 25:21 support our DMA so now we have an RDM a 25:28 neck both sides have to have these 25:35 special network interface cards so I'm 25:38 drawing these is connected by a cable in 25:39 fact always there's a switch here that 25:43 has connections to many different 25:47 servers and allows any server to talk to 25:50 any server okay so we have these RDMA 25:52 necks and we had again we have the 25:55 applications and application assist 25:56 memory and now though the application 26:02 can essentially send a special message 26:06 through the neck that asks so we have a 26:10 an application on the source host and 26:13 maybe we wouldn't call this the 26:15 destination host can send a special 26:18 message through the our DMA system that 26:22 tells this network interface card to 26:24 directly read or write a byte some bytes 26:28 of memory probably a cache line of 26:30 memory in 26:32 the target applications address space 26:33 directly so hardware and software on the 26:36 network interface controller are doing a 26:38 read and write read or write of the 26:40 application target applications memory 26:41 directly and then so we have a sort of 26:44 request going here that causes the read 26:47 or write and then sending the result 26:49 back to really two other incoming queue 26:55 on the source application and the cool 26:58 thing about this is that this computer's 27:01 the CPU this application didn't know 27:03 anything about the read or write the 27:06 read or write is executed completely in 27:09 firmware in the network interface card 27:12 so it's not there's no interrupts here 27:13 the application didn't have to think 27:15 about the request or think about 27:16 replying network interface card just 27:18 reads or writes a memory and sends a 27:20 result back to the source application 27:22 and this is much much lower overhead way 27:25 of getting at of all you need to do is 27:27 read or write memory and stuff in the 27:30 RAM of the target application this is a 27:32 much faster way of doing a simple read 27:35 or write than sending in our PC call 27:37 even with magic kernel bypass networking 27:42 it's a question does this mean that 27:46 already may always require kernel bypass 27:48 to work at all you know I don't know the 27:54 answer to that I think I've only ever 27:56 heard it used in conjunction with kernel 27:59 bypass cuz you know the people who are 28:02 interested in any of this or are 28:04 interested in it only for tremendous 28:07 performance and I think you would waste 28:10 you throw away a lot of the performance 28:12 I'm guessing you throw away a lot of the 28:14 performance win if you had to send the 28:16 requests through the kernel okay another 28:22 question that the the question notes 28:31 TCP software's TCP supports in order 28:34 delivery duplicate detection and a lot 28:36 of other excellent properties which you 28:38 actually need and so it would actually 28:41 be extremely awkward if this setup 28:44 sacrificed reliable delivery or in order 28:48 delivery and so the answer the question 28:50 is actually these are DMA NICs run their 28:53 own reliable sequenced protocol that's 28:56 like TCP although not TCP between the 28:59 necks and so when you ask your already 29:02 am a neck to do a read or write it'll 29:04 keep you transmitting until if the you 29:08 know if the request is lost and keep 29:09 reassurance meaning till it gets a 29:10 response and it actually tells the 29:12 originating software did the request 29:15 succeed or not so you get an 29:17 acknowledgment back finally so yeah you 29:20 know in fact have to sacrifice 29:22 most of TCP is good properties now this 29:25 stuff only works over a local network I 29:28 don't believe our DMA would be 29:32 satisfactory like between distant data 29:34 centers so there's all tuned up for very 29:38 low speed of light access okay a 29:45 particular piece of jargon that the 29:47 paper uses is one-sided our DMA and 29:56 that's basically what I've just 29:57 mentioned when application uses our DMA 30:01 to read or write the memory of another 30:02 that's one site our DMA now in fact farm 30:06 uses our DMA to send messages in an RPC 30:13 like protocol so in fact sometimes farm 30:15 directly reads with one sided our DMA 30:18 but sometimes what farm is using our DMA 30:20 for is to append a message to an 30:23 incoming message queue inside the target 30:26 so sometimes what the what the 30:27 well actually always with writes what 30:31 farm is actually doing is using our DMA 30:33 to write to append a new message to an 30:38 incoming queue in the target which the 30:40 target will pull since there's nobody 30:42 interrupts here the way target 30:45 the way the destination of a message 30:47 like this knows I got the messages that 30:49 periodically checks one of these keys 30:52 queues and memory to see how have I 30:53 gotten a recent message from anyone 30:55 okay so once I did already MA is just to 30:58 read or write but using our DMA to send 31:00 a message or append either to a message 31:02 queue or to a log 31:03 sometimes farm appends messages or log 31:06 entries to a log and another server also 31:09 uses our DMA and you know this memory 31:11 that's being written into is all 31:15 non-volatile so all of it the message 31:17 queues it's all written to disk if 31:20 there's a power failure the performance 31:24 of this is the figure 2 shows that you 31:29 can get 10 million small our DMA reads 31:34 and writes per second which is fantastic 31:37 far far faster than you can send 31:40 messages like our pcs using TCP and the 31:43 latency of using our DMA to do a simple 31:46 read or write is about 5 microseconds so 31:48 again this is you know very very short 5 31:54 microseconds is it's slower than 31:56 accessing your own local memory but it's 31:59 faster than sort of anything else people 32:01 do in networks ok so this is sort of a 32:05 promise there's this fabulous our DMA 32:07 technology that came out a while ago 32:10 that at the farm people wanted to 32:11 exploit you know the coolest possible 32:15 thing that you could imagine doing with 32:17 this is using our DMA one sign it 32:20 already am a reason rights to directly 32:23 do all the reason writes a records 32:26 stored in database servers memory so 32:28 wouldn't be fantastic if we could just 32:29 never talk to the database server CPU or 32:32 software but just get at the data that 32:34 we need you know in five microseconds a 32:37 pop using direct one-sided our DMA 32:40 Reiser writes so in a sense this paper 32:43 is about you know you you start there 32:45 what do you have to do to actually build 32:49 something useful so an interesting 32:53 question by the way is could you in fact 32:56 implement transactions 32:58 using one-sided RDMA that is you know 33:01 anything we wanted to read or write data 33:03 in server the only use already may and 33:07 never actually send messages that have 33:10 to be interpreted by the server software 33:13 it's worth thinking about 33:16 in a sense farm is answering that 33:19 question with a no because that's not 33:22 really how farm works but but it is 33:25 absolutely worth thinking how come 33:28 pure one-sided RDMA couldn't be made to 33:31 work alright so the challenges to using 33:37 our DMA in a transactional system that 33:42 has replication and sharding so that 33:46 that's the challenge we have is how to 33:48 combine already made with transactions 33:50 charting and replication because you 33:52 need to have sharding and transactions 33:54 replication to have a seriously useful 33:56 database system it turns out that all 33:59 the protocols we've seen so far for 34:02 doing transactions replication require 34:04 active participation by the server 34:06 software that is the server has to be in 34:11 all the protocols we've seen so far the 34:13 server's actively involved in helping 34:15 the clients get at read or write the 34:17 data so for example in the two-phase 34:21 commit schemes we've seen the server has 34:23 to do things like decide whether a 34:25 record is locked and if it's not walk 34:27 set the lock on it right it's not clear 34:31 how you could do that with our DMA the 34:35 server has to do things like in spanner 34:37 you know there's all these versions it 34:39 was the server that was thinking about 34:40 how to find the latest version similarly 34:43 if we have transactions in two-phase 34:45 commit 34:45 data on the server it's not just data 34:48 there's committed data there's data 34:50 that's been written but hasn't committed 34:52 yet and again traditionally it's the 34:54 server that sorts out whether data 34:58 recently updated data is committed yet 35:00 and that's to sort of protect the 35:01 clients from you know prevent them from 35:03 seeing data that's locked or not yet 35:06 known to be committed and what that 35:08 means is that without some clever 35:10 thought 35:12 RDMA or one-sided pure use of our DME 35:15 one-sided RDMA doesn't seem to be 35:18 immediately compatible with transactions 35:21 and replication and indeed farm while 35:26 farm does use one-sided it reads to get 35:29 out directly at data in the database it 35:32 is not not able to use one-sided rights 35:34 to modify the data okay so this leads us 35:43 to optimistic concurrency control it 35:46 turns out that the main trick in a sense 35:54 that farm uses to allow it both use RDMA 35:59 and get transactions is by using 36:02 optimistic concurrency control so if you 36:05 remember I mentioned earlier that 36:12 concurrency control schemes are kind of 36:14 divided into two broad categories 36:18 pessimistic and optimistic pessimistic 36:23 schemes use locks and the idea is that 36:28 if you have a transaction that's gonna 36:30 read or write some data before you can 36:32 read or write the data or look at it at 36:34 all it must acquire a lock and it must 36:36 wait for the lock and so you read about 36:41 two-phase locking for example in that 36:45 reading from 633 so before you use data 36:48 you have to lock it and you hold the 36:50 lock for the entire duration of the 36:52 transaction and only if the transaction 36:54 commits or aborts do you release the 36:56 lock and if there's conflicts because 37:01 two transactions want to write the same 37:05 data at the same time or one wants to 37:06 read and one that monster right they 37:08 can't do it at the same time one of them 37:10 has to block or all but one of the 37:12 transactions that went to you some data 37:13 missed a block wait for the lock to be 37:15 released um and of course this locking 37:17 scheme is the fact that the data has to 37:20 be locked and that somebody has to keep 37:21 track of who owns the lock and when the 37:23 lock is released etcetera 37:27 this is one thing that makes our DMA 37:30 it's not clear how you can do rights or 37:33 even reads using our DMA in a locking 37:35 scheme because somebody has to enforce 37:37 the locks I'm being a little tentative 37:40 about this because I suspect that with 37:42 more clever our DMA NICs that could 37:45 support a wider range of operations like 37:49 atomic test and set 37:51 you might someday be able to do a 37:54 locking scheme with pure one-sided RDMA 37:58 but farm doesn't do it okay so what farm 38:02 actually uses as an optimistic scheme 38:04 and here in an optimistic scheme you can 38:08 use at least you can read without 38:12 locking you just read the data you don't 38:18 know yet whether you are allowed to read 38:20 the data or whether somebody else is in 38:21 the model middle of modifying it or 38:23 anything you just read the data and a 38:25 transaction it uses what it whatever it 38:27 happens to be and you also don't 38:30 directly write the data in optimistic 38:33 schemes instead you buffered so you 38:35 buffer writes locally and in the client 38:38 until the transaction finally ends and 38:41 then when the transaction finally 38:44 finishes and you want to try to commit 38:46 it there's a validate what's called a 38:50 validation stage in which the 38:56 transaction processing system tries to 38:58 figure out whether the actual reason 39:00 rights you did were consistent with 39:02 serializability that is they try to 39:04 figure out oh was somebody writing the 39:06 data while I was reading it and if they 39:07 were boy 39:08 we can't commit this transaction because 39:10 it computed with garbage instead of 39:13 consistent read values and so if the 39:18 validation succeeds then you commit and 39:21 if the validation doesn't succeed if you 39:24 detect somebody else was messing with 39:25 the data while you were trying to use it 39:26 at abort so that means that if there's 39:29 conflicts if you're reading or writing 39:33 data and some other transactions also 39:35 modifying at the same time 39:38 optimistic schemes abort at that point 39:41 because the computation is already 39:43 incorrect at the commit point that is 39:45 you already read the damage data you 39:47 weren't supposed to read so there's no 39:49 way to for example block you know until 39:52 things are okay 39:53 instead the transactions already kind of 39:56 poisoned and just has to abort and 39:57 possibly be try okay so farm uses 40:03 optimistic because he wants to be able 40:05 to use one-sided RDMA to just read 40:08 whatever's there very quickly so this 40:12 this design was really forced by use of 40:16 our DMA this is often abbreviated OCC 40:19 for optimistic concurrency control all 40:26 right and then the interesting thing an 40:27 optimistic concurrency control protocols 40:28 is how validation works how do you 40:31 actually detect that somebody else was 40:33 writing the data while you were trying 40:35 to use it and that's actually mainly 40:38 gonna be what I talked about in the rest 40:39 of this lecture and just again though 40:42 just to retire this back to the top 40:44 level of the design what this is doing 40:47 for farm is that the reads can use 40:49 one-sided RDMA because and therefore be 40:56 extremely fast because we're gonna check 40:59 later whether the reads were okay all 41:07 right farms a research prototype it 41:10 doesn't support things like sequel it 41:16 supports a fairly simple API for 41:20 transactions this is the API just to 41:23 give you a tease for what a transaction 41:26 code might actually look like if you 41:29 have a transaction it's gotta to clear 41:31 the start of the transaction because we 41:32 need to say oh this particular set of 41:34 Reason rights needs to occur as a 41:36 complete transaction the code declares a 41:40 new transaction by calling TX create 41:43 this is all laid out by the way in the 41:45 paper I think from 2014 a slightly 41:47 earlier paper by the same authors 41:51 you create a new transaction and then 41:53 you explicitly read those functions to 41:55 read objects and you have to supply an 42:02 object identifier an OID indicating what 42:06 object you want to read then you get 42:08 back some object and you can modify the 42:10 object in local memory and we didn't 42:12 write it you have a copy of it that 42:14 you've read from the server the TX read 42:16 back from the server so you know you 42:18 might increment some field in the object 42:22 and then when you want to update an 42:26 object you call this TX right and again 42:30 you give it the object ID and the new 42:34 object contents and finally when you're 42:37 through with all of this you've got to 42:39 tell the assistant to commit this 42:41 transaction actually do validation and 42:44 if it succeeds cause the rights to 42:46 really take effect and be visible and 42:48 you call this commit routine the 42:53 community team runs a whole bunch of 42:54 stuff in figure 4 which we'll talk about 42:56 and it returns this okay value and it's 42:59 required to tell the application oh did 43:02 the commit succeed or was it aborted so 43:04 we need the return this okay return 43:07 valued you know correctly indicate by 43:09 the transaction succeeded okay there's 43:13 some questions one is question since OCC 43:16 aborts if there's contention question is 43:20 whether retries involve exponential 43:23 back-off because otherwise it seems like 43:25 if you just instantly be tried and that 43:29 there were a lot of transactions all 43:32 trying to update the same value at the 43:34 same time they'd all aboard they'd all 43:36 retry and waste a lot of time and I 43:39 don't know the answer to that question I 43:40 don't remember seeing them mentioning 43:42 exponential back-off in the paper but it 43:44 would make a huge amount of sense to 43:46 delay between retries and to increase 43:49 the delay to give somebody a chance of 43:52 succeeding this is much like the 43:56 randomization of the raft collects 43:59 Tanner's another question is the farm 44:04 API closer in spirit to a no sequel 44:05 database yeah you know that's one way of 44:09 viewing it it really that it doesn't 44:12 have any of the fancy query stuff like 44:15 joins for example that sequel has it's 44:19 really a very low-level kind of 44:20 readwrite interface plus the transaction 44:25 support so you you can sort of view it 44:27 as a no sequel database maybe with with 44:30 transactions all right this is what a 44:36 transaction looks like and these are all 44:40 these are library calls created 44:41 read/write commit commit as a sort of 44:44 complex write recall that actually runs 44:46 the transaction coordinator code first 44:49 what a rare variant of two-phase commit 44:52 this described in figure four just 44:56 repeat that the while the recall goes 44:59 off and actually reads the relevant 45:00 server the right call just locally 45:04 buffers then the new the modified object 45:08 and it's only in commit that the objects 45:10 are sent to the servers these object IDs 45:14 are actually compound identifiers for 45:17 objects and they contain two parts one 45:19 is the identify a region which is that 45:25 all the memory of all the servers is 45:27 split up into these regions and the 45:29 configuration manager sort of tracks 45:31 which servers replicate which region 45:34 number so there's a reason number in 45:36 here and then and then you you know you 45:40 client can look up in a table the 45:42 current primary and backups for a given 45:44 region number and then there's an 45:46 address such as the straight memory 45:47 address within that region and so the 45:53 client uses the reason number to pick 45:55 the primary in the backup to talk to and 45:57 then it hands the address to the our DMA 46:00 NIC and tells it look please read at 46:03 this address in order to get fetch this 46:05 object 46:11 alright another piece of detail we have 46:14 to get out of the way is to look at the 46:17 server memory layout I'm in any one 46:25 server there's a bunch of stuff in 46:29 memory so one part is that the server 46:33 has in its memory its if it's it's 46:37 replicating one or more regions it has 46:39 the actual regions and or what a reason 46:41 contains is a whole bunch of these 46:43 objects and each object there's a lot of 46:47 objects objects sitting in memory each 46:52 object has in it a header which contains 46:58 the version number so these are 47:00 versioned objects but each object only 47:03 has one version at a time so this is 47:07 version number and in the high bit let 47:13 me try again here and the high bit of 47:15 each version number is a lock flag so in 47:17 the header of an object there's a lock 47:19 flag and the high bit and then a version 47:21 number in a little bit and then the 47:26 actual data of the object so each object 47:29 has the same servers memory it's the 47:31 same layout a lock bit in the high bit 47:33 and the current version number a little 47:37 bit and every time the system writes 47:39 modifies an object it increases the 47:41 version number and let's see how the 47:42 lock bits are used in a couple minutes 47:44 in addition in the server's memory there 47:47 are pairs of cue pairs of message queues 47:52 and logs one for every other computer in 47:58 the system so that means that you know 48:03 if there's four other servers in the 48:09 system that are running or if there's 48:11 four servers that are running 48:12 transactions there's going to be four 48:15 logs sitting in memory that can be 48:18 appended to with our DMA one for each of 48:21 the other servers and that means that 48:23 one for each of the other computers can 48:25 run transactions so that means that the 48:27 transaction code running on you know so 48:31 number of them you know it's the 48:33 transaction code running on computer to 48:36 when it wants to talk to this server and 48:39 append to its log which as well see it's 48:42 actually going to append to server twos 48:45 log in this servers memory so there's a 48:47 total N squared of these queues floating 48:49 around in in in each servers memory and 48:54 it certainly seems like there's actually 48:56 one set of logs which are meant to be I 48:59 would non-volatile and then also 49:03 possibly a separate set of message 49:06 queues which are used just for more RPC 49:09 like communication again one in each 49:12 server one queue message incoming 49:14 message cube per other server written 49:16 with our DMA writes all right actually 49:28 the next thing to talk about is a year 49:31 four in the paper 49:33 this is feet four and this explains the 49:39 occ commit protocol that farm uses and 49:46 I'm gonna go through mostly steps one by 49:48 one and actually to to begin with I'm 49:51 gonna focus only on the concurrency 49:52 control part of this it turns out these 49:55 steps also do replication as well as 49:58 implement serializable transactions but 50:01 we'll talk about the replication for 50:04 fault tolerance a little bit later okay 50:07 so the first thing that happens is the 50:08 execute phase and this is the TX reads 50:11 and TX writes the reason writes that the 50:13 client transaction is doing and so each 50:17 of these arrows here what this means is 50:19 that the transaction runs on computer C 50:21 and when 50:22 needs to read something it uses 50:26 one-sided RDMA we to simply read it out 50:29 of the relevant primary servers memory 50:31 so what we got here was a primary backup 50:34 primary backup primary backup for three 50:37 different shards and we're imagining 50:39 that our transaction read something from 50:42 one object from each of these shards 50:44 using one-sided RDMA reason that means 50:47 these blindingly fast five microseconds 50:50 each okay so the client reads everything 50:54 it needs to read for the transaction 50:56 also anything that's going to write it 50:58 first reads and it has to do it do this 51:01 read has to first read because it needs 51:04 to get the version number 51:05 the initial version number all right so 51:08 that's the execute phase then when the 51:11 transaction calls TX commits to indicate 51:14 that it's totally done the library on 51:18 the you know the TX commit call on on 51:21 the client acts as a transaction 51:22 coordinator and runs this whole protocol 51:26 which is a kind of elaborate version of 51:28 two-phase commit the first phase and 51:35 that's described in terms of rounds of 51:38 messages so the transaction coordinator 51:40 sends a bunch of LOC messages and wait 51:41 for them to reply and then validate 51:43 messages and waste for the all the 51:45 replies so the first phase in the commit 51:48 protocol is the lock fees in this phase 51:51 what the client is sending is it sends 51:55 to each primary the identity of the 52:00 object in for each object for clients 52:02 written and needs to send that updated 52:03 object to the relevant primary so it 52:06 sends the updated objects the primary 52:09 and as an as a new log entry in the 52:15 primaries log you know for this client 52:18 so the client really abusing already 52:20 made to append to the primaries log and 52:22 what it's appending is the object ID of 52:27 the writ of the object wants to write 52:28 the version number that the client 52:30 initially read when it read the object 52:33 and the new value 52:36 so it appends the object of yours number 52:38 and new value to the primary logon for 52:42 the primary beach of the charge that 52:43 it's written an object in so these I 52:48 guess what's going on here is that the 52:51 this transaction wrote two different 52:53 objects one on primary one and the other 52:55 on primary to know when this is done 52:57 when the transaction coordinator gets 52:59 back the well alright so now the these 53:05 new log records are sitting in the logs 53:07 of the primaries the primary though has 53:09 to actually actively process these log 53:11 entries because it needs to check and 53:14 they sort of do a number of checks 53:15 involved with validation to see if the 53:18 if this primary is part of the 53:20 transaction can be allowed to commit so 53:23 at this point we have to wait for each 53:25 primary to to poll the this clients log 53:30 in the primaries memory see that there's 53:32 a new log entry and process that new log 53:34 entry and then send a yes-or-no vote to 53:39 say whether it is or is not willing to 53:40 do its part of the transaction all right 53:43 so what does the primary do when it's 53:46 polling loop sees that an incoming lock 53:51 log entry from a client first of all if 53:56 that object with the object ID is 53:58 currently blocked then the primary 54:02 rejects this lock message and sends back 54:05 a message to the client using RDMA 54:07 saying no that this transaction cannot 54:09 be allowed to proceed I'm voting no and 54:11 two-phase commit and that will cause the 54:13 transaction coordinator to abort the 54:14 transaction and the other is not locked 54:18 then the next thing the primary does is 54:21 check the version numbers it checks to 54:22 make sure that the version number that 54:24 the client sent it that is the version 54:26 number of the client originally read is 54:28 unchanged and if the version numbers 54:31 changed that means that between when our 54:34 transaction read and when it wrote 54:35 somebody else wrote the object if the 54:38 version numbers changed and so the 54:39 version numbers changed again the 54:41 primary will respond no and forbid the 54:44 transaction from continuing but if the 54:47 version number is the same in the lock 54:48 that's not set 54:51 and the primary will set the lock and 54:57 return a positive response back to the 55:00 client now because the primary's 55:06 multi-threaded running on multiple CPUs 55:08 and there may be other transactions 55:10 there may be other CPUs reading other 55:13 incoming log cues from other clients at 55:16 the same time on the same primary there 55:18 may be races between different 55:20 transactions or lock the clock record 55:23 processing from different transactions 55:26 trying to modify the same object so the 55:29 primary actually uses an atomic 55:31 instruction a compare and swap 55:33 instruction in order to both check check 55:39 the version number and lock and set the 55:42 lock a bit on that version number as an 55:44 atomic operation and this is the reason 55:46 why the lock of it has to be in the high 55:48 bits of the version number so that a 55:49 single instruction can do a compare and 55:55 swap on the version number and the lock 55:57 bit okay now one thing to note is that 56:02 if the objects already locked 56:05 there's no blocking there's no waiting 56:07 for the lock to be released the primary 56:09 simply sends back a know if some other 56:12 transaction has it locked alright any 56:15 questions about the lock fees of of 56:18 Committee all right back in the trend 56:23 head in the client which is acting his 56:25 transaction coordinator it waits for 56:26 responses from all the primaries from 56:28 the primaries of the shard so for every 56:31 object that the transaction modified if 56:34 any of them say no if they need them 56:37 reject the transaction then the 56:38 transaction coordinator aborts the whole 56:39 transaction and actually sends out 56:42 messages to all the primaries saying I 56:43 changed my mind I don't want to commit 56:46 this transaction after all but if they 56:48 all answered yes of all the primaries 56:50 answer yes then the transaction 56:54 coordinator thinks that decides that the 56:56 transaction can actually commit but the 57:00 primaries of course don't know whether 57:01 they all voted yes 57:02 or not so the transaction coordinator 57:05 has to notify every ball the primary so 57:07 yes deed everybody voted yes so please 57:10 do actually commit this and the way the 57:14 client does this is by appending another 57:17 record to the logs of the primaries for 57:20 each modified object this time it's a 57:23 commit backup record that it's a pending 57:26 and the this time the transaction 57:30 coordinator I'm sorry I did commit 57:33 primary I'm skipping over valide didn't 57:35 commit backup for now I'll talk about 57:37 those later so just ignore those for the 57:39 moment the transaction coordinator goes 57:42 on to commit primary sends pens that 57:44 commit primary to each primaries log and 57:47 the transaction coordinator only has to 57:49 wait for the hardware RDMA 57:51 acknowledgments it doesn't have to wait 57:54 for the primary just actually process 57:58 the log record the transaction 58:01 coordinator it turns out as soon as it 58:02 gets a single acknowledgment from any of 58:04 the primaries it can return yes the okay 58:08 equals true to the transactions 58:10 signifying that the transaction six 58:12 succeeded and then there's another stage 58:16 later on where the once the transaction 58:20 coordinator knows that every primary 58:21 knows that the transaction coordinated 58:24 committed you can tell all the primaries 58:29 that they can discard all the log 58:30 entries for this transaction okay now 58:35 there's one last thing that has to 58:37 happen the primaries which are looking 58:40 at the logs their polling the Long's 58:42 they'll notice that there's a commit 58:44 primary record at some point and then on 58:46 the primary that receives the commit 58:49 primary log entry will it knows that it 58:53 had locked that object previously and 58:58 that the object must still be locked so 58:59 what the primary will do is update the 59:01 object in its memory with the new 59:03 contents that were previously sent in 59:05 the lock message I'm increment the 59:07 version number associated with that 59:09 object and finally clear the lock bit on 59:11 that object and what that means is that 59:13 as soon as a primary 59:16 receives and processes a commit primary 59:18 log message it may since it clears the 59:21 lock a bit and updates the data it may 59:24 well expose this new data to other 59:27 transactions other transactions after 59:28 this point are free to use it are free 59:30 to use the object with its new value and 59:34 new version number all right I'm gonna 59:40 do an example any questions about the 59:44 machinery before I start thinking about 59:46 an example feel free to ask questions 59:51 any time alright so how about an example 59:55 let's suppose we have two transactions 59:59 transaction one and transaction two and 60:02 they're both trying to do the same thing 60:03 they both just wanna increment X X is 60:09 the object sitting off in some servers 60:12 memory so so both we got two 60:18 transactions running running through 60:19 this before we look into what actually 60:21 happens we should remind ourselves what 60:22 the valid possibilities are for the 60:26 outcomes so and that's all about 60:30 serializability farm guaranteed 60:32 serialize ability so that means that 60:33 whatever farm actually does it has to be 60:35 equivalent to some one at a time 60:37 execution of these two transactions so 60:40 we're allowed to see was the results you 60:41 would see if t1 ran and then strictly 60:44 afterwards t2 ran or we can see the 60:47 results that could ensue if t2 ran and 60:50 then t1 run those are the only 60:52 possibilities now in fact farm is 60:57 entitled to abort a transaction so we 61:00 also have to consider the possibility 61:01 that one of the two transactions aborted 61:03 or indeed that they both aborted I since 61:06 they're doing both doing the same thing 61:08 there's a certain amount of symmetry 61:09 here so one possibility is that they 61:15 both committed and that means two 61:18 increments happen so one legal 61:20 possibilities that X is equal to 2 and 61:25 both then the TX 61:28 it has to agree with whether things a 61:30 bit or aborted or committed so that both 61:35 transactions need to CTX commit returned 61:40 true in this case another possibility is 61:44 that only one of them transactions 61:46 committed and the other aborted and then 61:48 we want to see only one true and the 61:52 other false and another possibilities 61:56 maybe they both aborted we don't think 61:58 this could necessarily happen but it's 62:00 actually legal so that X isn't changed 62:03 and we want both to get false back from 62:09 TX commit so we better better not see 62:14 anything other than these three options 62:21 all right so of course what happens 62:24 depends on the timing so I'm going to 62:31 integrate out various different ways 62:33 that the commit protocol could in early 62:35 even for convenience I have a handy 62:39 reminder of what the actual commit 62:41 protocol is here so one possibility is 62:46 that they run exactly in lockstep they 62:51 both send all their messages at the same 62:55 time they both read at the same time I'm 62:57 going to assume that X starts out as 62:59 zero if they both read at the same time 63:00 that we're going to see zero I assume 63:03 they both sent out lakh messages at the 63:05 same time 63:08 and indeed they accompany their log 63:10 messages with the value one since 63:11 they're adding 1 to it and that if they 63:13 commit if they walk messages say yes 63:16 then they would if they did both commit 63:19 at the same time so if if this is the 63:25 scenario what's going to happen and why 63:31 you 63:34 they like to raise their hand and hazard 63:37 a guess 63:48 well that's really good field to be 63:50 since that's a one-sided read can't 63:52 possibly fail they're both gonna send in 63:55 fact identical walk messages to whatever 63:59 primary holds object X and I both send 64:04 the same version number but a version 64:06 number they read and the same value so 64:08 the primaries gonna see to log meant to 64:10 log messages in two different incoming 64:14 logs assuming these are running on 64:16 different clients and exactly what 64:23 happens now is slightly left up to our 64:25 imagination by the paper but I think the 64:28 two incoming log messages could be 64:29 processed in parallel on different cores 64:31 on the primary but the critical 64:35 instruction of the primary is the atomic 64:37 test and set or compare and swap exactly 64:41 somebody's volunteer the answer that one 64:44 of them will get to the compare and swap 64:46 instruction first and whichever core I 64:51 guess the compare and swap instruction 64:53 first it'll set the lock bit on that 64:58 objects version and will observe the 65:00 lock a bit wasn't previously set which 65:03 everyone executes the atomic 65:04 compare-and-swap second will observe the 65:06 lock that's already set I mean he's the 65:08 one of the two will return yes and the 65:11 other two will fail the lock observe the 65:13 lock is already set immature no and you 65:17 know it for symmetry I'm just going to 65:19 imagine that transaction to the primary 65:24 sends back a no so the transaction to 65:25 use client code will abort transaction 1 65:29 I've got the lock got a yes back and it 65:32 will actually commit when it come 65:35 it's when the primary actually gets the 65:38 commit message it'll install the updated 65:40 object 65:41 you know increments to to clear the lock 65:43 bit increment the version and return 65:46 true this is gonna say true because the 65:51 other primary sent back I know that 65:55 means that TX commits gonna return false 65:57 here and the final value would be x 66:00 equals one that was one of our allowed 66:03 outcomes but of course it's not the only 66:06 in are leaving any questions about how 66:12 this played out or wide executed the way 66:16 it did 66:19 okay so there's other possible 66:21 interleavings so how about how about 66:25 this one let's imagine that transaction 66:30 2 does the beat first 66:33 she doesn't really matter what the reads 66:37 are concurrent or not then transaction 66:39 one doesn't read and then transaction 66:41 went a little bit faster and it gets its 66:43 lock message in and a reply and gets a 66:47 commit back and then afterwards 66:50 transaction two gets going again and 66:55 sends a lock message in if it could 66:58 commit so what happens this time 67:16 well is this law commissioner is gonna 67:20 be succeed because there's no reason to 67:22 believe there's a lock bit is set 67:24 because the second lock message hasn't 67:26 even been sent message we'll set the 67:28 lock the commit message this commit 67:31 primary message should actually clear 67:32 the lock a bit so the lock bit will be 67:35 clear by the time t2 census inserts its 67:39 lock entry in primaries log so this the 67:49 primary won't see the lock a bit set at 67:51 this point yeah so somebody's 67:54 volunteered that what this primary will 67:57 see is that the version number so the 68:00 the lock message contains the version 68:01 number the transaction to originally 68:03 read and so the primary is gonna see 68:05 wait a minute this since commit primary 68:08 increments of version number the the 68:11 primary is gonna see that the version 68:12 number is wrong there's numbers now 68:14 higher on the real object and so it's 68:16 actually gonna send back a a no response 68:20 to the coordinator and the coordinator 68:24 is gonna abort this transaction and 68:26 again we're gonna get x equals 1 one of 68:29 the transactions return true the other 68:31 returned false which is the same final 68:35 outcome as before and it is allowed any 68:40 questions about how this played out a 68:44 slightly different scenario would be as 68:47 if and actually okay the slightly 68:51 different scenario I was gonna think of 68:52 think of was one in which the commit 68:54 message was stole it happened after this 68:57 lock this is essentially the same as the 68:59 first scenario in which this transaction 69:04 got the lock set in this transaction 69:05 observed lock okay 69:10 everyone one last scenario let's suppose 69:18 we see this 69:32 what's going to happen this time 69:34 [Music] 69:47 yeah somebody has a right answer at the 69:51 of course the first transaction will go 69:53 through because there's no contention in 69:54 the first transaction the second 69:56 transaction when it goes to read X will 69:58 actually see the new version number as 70:02 incremented by the commit primary 70:04 processing on the primary so it'll see 70:07 the new version number the lock that 70:09 won't be set and so then when it goes to 70:11 send its lock log entry to the primary 70:16 lock lock that locked processing code in 70:19 the primary Co the locks not set and the 70:21 version is the same hasn't this is the 70:23 latest version and it all I want to 70:24 commit and so for this the outcome we're 70:26 gonna see is x equals 2 because this 70:29 read not only read the new version um 70:31 but actually read the new value which 70:32 was one so this is incorrect here and 70:40 both calls to TX commit will be true yes 70:47 that's right succeed it with x equals 2 70:51 all right so you know this happened to 70:53 work out in these cases the intuition 70:56 behind why optimistic concurrency 70:59 control provides serializability why it 71:03 why it basically checks that the 71:06 execution that did happen is the same as 71:09 a one at a time execution essentially 71:13 the intuition is that if there was no 71:15 conflicting transaction then the version 71:17 numbers and the lock bits won't have 71:19 changed if nobody else is messing with 71:20 these objects you know I'll see the same 71:23 version numbers at the end of the 71:25 transaction as we did when we first read 71:27 the object whereas if there is a 71:30 conflicting transaction between when we 71:32 read the object and when we try to 71:33 commit a change and that conflicting 71:37 modified something then if it actually 71:42 started to commit we will see a new 71:43 version number or a lock a bit set so 71:47 the comparison of the version numbers 71:48 and lock bits between when you first 71:49 read the object and when you finally 71:51 commit it kind of tells you whether some 71:53 other commits to the objects snuck in 71:56 while you were using them all right and 72:02 you know the cool thing to remember here 72:03 is that this allowed us to do the reads 72:08 the use of this optimistic schema which 72:11 we don't actually check the locks only 72:13 when we first use the data allowed us to 72:15 use this extremely fast one sided 72:17 already ma reads to read the data and 72:20 get high performance ok so the way I've 72:24 explained it so far without validate and 72:28 without commit back up is the way the 72:29 system works but as I see validate is 72:34 sort of an optimization for just reading 72:38 an object but not writing it and commit 72:41 back up as part of the scheme for fault 72:43 tolerance I think I'm gonna a few 72:46 minutes we have left I want to talk 72:47 about validate so the validate stage is 72:52 it's an optimization for to treat 72:56 objects that we're only read by the 72:57 transaction and I'm not written and it's 72:59 going to be particularly interesting if 73:00 it's a straight read-only transaction 73:03 that modified nothing and you know the 73:05 optimization is that it's going to be 73:08 that the transaction coordinator can 73:11 execute the validate with a one-sided 73:13 read that's extremely fast rather than 73:15 having to put something on a log and 73:17 wait for the primary to see our log 73:20 entry and think about it so this 73:22 validates one-sided B is going to be 73:24 much much faster it's gonna essentially 73:26 replace lock for objects that would only 73:28 read it's gonna be much faster 73:35 basically what's going on here is that 73:36 the what what the validate does is the 73:41 transaction coordinator refetch is the 73:44 object header so you know it would have 73:46 read an object say this object in the 73:49 execute phase when it's committing it 73:51 instead of sending a lock message it be 73:54 fetches the object hit header and checks 73:56 whether the version number now is the 73:59 same as the version number when it first 74:01 read the object and it also checks if 74:03 the lock of it is clear so so that's how 74:10 it works 74:10 so instead of setting a lock message 74:12 send this validate message should be 74:13 much faster for a read-only operation so 74:17 let me put up another transaction 74:21 example and run through it how it works 74:23 let's suppose x and y are initially 0 we 74:26 have two transactions t1 if X is equal 74:32 to zero set y equal one and T two says 74:40 if Y is zero 74:44 said x equals one but this is a 74:47 absolutely classic test for strong 74:51 consistency if the execution is 74:56 serializable it's going to be either t1 75:00 then t2 or t2 and t1 it's got to get to 75:05 see any you know corrected 75:07 implementation has to get the same 75:08 results it's running them one at a time 75:10 if you run T 1 and then t2 you're gonna 75:13 get y equals 1 and x equals 0 because 75:18 the second if statement Y is already 1 75:21 the second if statement won't do 75:22 anything and symmetrically this will 75:26 give you x equals 1 and y equals 0 and 75:31 it turns out that if you if they both 75:33 abort you can get x equals 0 y equals 0 75:36 but what you are absolutely not allowed 75:38 to get is x equals 1 y equals 1 that's 75:45 not allowed 75:48 ok so we're looking for how I'm going to 75:53 use this as a test see what happens with 75:56 validate and again we're gonna suppose 75:58 these two transactions execute most so 76:04 obvious cases they execute it absolutely 76:06 at the same time and it eat that's the 76:11 that's the hardest case okay so as we 76:17 have read of X meet Y 76:27 why because we wrote it and lock why 76:29 here I sort of lock X here but since now 76:35 we're using this read-only a validation 76:37 optimization that means this one has to 76:39 validate why this one has to validate X 76:41 you know it's a red X but didn't write 76:43 it so it's going to validate it much 76:45 quicker and maybe it's going to commit 76:47 and maybe it's and so the question is if 76:50 we use this validate as I described it 76:53 that just checks the version number and 76:54 lock but haven't the version number 76:56 hasn't changed in the lock but isn't set 76:58 will we get a a correct answer 77:22 and no actually both the validation is 77:25 gonna fail for both because when these 77:29 LOC messages were processed by the 77:31 relevant primaries they cause the LOC a 77:33 bit just to be set initially presumably 77:36 the the reason okay did a cleared lock 77:38 bin but when we come to validate even 77:42 though the client is doing the one-sided 77:44 read of the object header for X&Y it's 77:48 gonna see the lock bit that was set by 77:50 the processing of these lock requests 77:55 and so they're both gonna see the lock 77:57 bits set on the object that they merely 77:59 read and they're both going to abort and 78:04 neither X nor Y will be modified and so 78:08 that was one of the legal outcomes 78:10 that's right somebody somebody notice 78:12 this indeed both validates will fail 78:16 another of course sometimes that a 78:19 transaction can go through and here's a 78:21 scenario in which it does work out 78:27 this was transaction one is a little 78:30 faster validates 78:43 all right so what's going to happen a 78:45 transaction one is a little bit faster 78:50 so this time it's validates gonna 79:05 succeed because nothing has happened to 79:07 X between when transaction 1 read it and 79:09 when it validated so presumably the lock 79:12 also went through without any trouble 79:14 because nobody's modified Y here either 79:15 so the primary answered yes for this the 79:18 one-sided read revealed an unchanged 79:21 version number and lock bit here and so 79:24 transaction one can commit and it will 79:26 have incremented Y but by this point if 79:29 this is the order when the primary 79:32 process is this actually when the 79:37 primary process is lock of X this will 79:38 also go through with no problem because 79:40 nobody's modified X when the primary for 79:43 Y processes the validate for Y though 79:47 it's I'm sorry when the client running 79:51 transaction two refetch is the version 79:54 number unlocked it for y it's either 79:57 gonna see this really depends on whether 79:59 the committee's happen if the commit 80:02 hasn't happened yet this valid a will 80:03 see that the lock bit is set because it 80:05 was set back here if the commit has 80:07 happened already then the lock bit of 80:09 will be clear but this validate 80:11 one-sided reader will see a different 80:13 version number than was originally seen 80:17 and it needs somebody it's just this 80:18 answer so one will commit so that 80:20 transaction one will commit and 80:23 transaction to will abort 80:25 and although I don't have time to talk 80:27 about it here if there's a straight 80:29 read-only transaction then there doesn't 80:31 need to be a locking phase and there 80:33 doesn't need to be a commit phase pure 80:35 read-only transactions can be done with 80:36 just just reading blind reads for the 80:39 reads 80:40 sorry one-sided RDMA reads for the reads 80:42 one-sided already me reads for the 80:44 validates and so they're extremely fast 80:46 read-only transactions are and don't 80:48 require any work any attention by the 80:52 server 80:54 so and this is at the heart you know 80:58 trends these reads and indeed though 81:00 everything about farm is very 81:04 streamlined - partially due to our DMA 81:06 and it uses OCC because it's basically 81:09 forced to in order to be able to do 81:12 reads without checking locks there are a 81:15 few brown downsides though it turns out 81:17 optimistic concurrency control really 81:18 works best if there's relatively few 81:20 conflicts if there's conflicts all the 81:23 time then transactions will have to 81:26 board and there's a you know a bunch of 81:27 other restrictions I already mentioned 81:29 like on farm like the data must all fit 81:31 in the RAM and all the computers must 81:33 mean that the same data center 81:35 nevertheless this was viewed at the time 81:39 and still as just a very surprisingly 81:41 high-speed implementation of distributed 81:45 transactions like just much faster than 81:48 any system in sort of in production use 81:52 and it's true that Hardware involves a 81:54 little bit exotic and really depends on 81:56 this non-volatile Ram scheme and it 81:58 depends on these special RDMA NICs and 82:01 those are not particularly pervasive now 82:04 but you do but you can get them and with 82:08 performance like this it seems likely 82:09 that they'll both in viewing and already 82:11 me will eventually be pretty pervasive 82:14 in data centers so that people can play 82:16 these kind of games and that's all I 82:19 have to say about farm happy to take any 82:23 questions if anybody has some and if not 82:26 I'll see you next week with a spark 82:29 which is you may be happy to know 82:31 absolutely not about transactions I 82:33 heard everyone bye-bye 82:35 [Music]