字幕記錄 00:00 all right everybody let's get started 00:05 today the paper four days I'm is aunt 00:10 Aurora paper which is all about how to 00:12 get a high-performance reliable database 00:17 going as a piece of cloud infrastructure 00:19 and itself built out of infrastructure 00:22 that Amazon itself makes available so 00:29 the reason why we're reading this paper 00:30 is that first of all it's a very 00:32 successful recent cloud service from 00:34 Amazon a lot of their customers use it 00:38 it shows sort of in its own way an 00:41 example of a very big payoff from clever 00:44 design table one which sort of 00:46 summarizes the performance shows that 00:48 relative to some other system which is 00:51 not very well explained the paper claims 00:53 to get a thirty five times speed up in 00:55 transaction throughput which is 00:57 extremely impressive this paper also 01:00 kind of explores the limits of how well 01:03 you can do for performance and fault 01:04 tolerance using general-purpose storage 01:06 because one of the themes of the papers 01:08 they basically abandoned general-purpose 01:10 storage they switch from a design in 01:12 which they were using their Amazon's own 01:14 general-purpose storage infrastructure 01:16 decided it was not good enough and 01:17 basically built totally 01:19 application-specific storage 01:22 furthermore the paper has a lot of 01:23 little tidbits about what turned out to 01:25 be important in this and the kind of 01:29 cloud infrastructure world so before 01:32 talking about aurora i want to spend a 01:35 bit of time kind of going over the back 01:36 history or what my impression is about 01:38 the story that led up to the design of 01:41 aurora because it's you know the sort of 01:43 m f-- way that amazon has in mind that 01:47 you ought to build that their cloud 01:49 customers ought to build databases on 01:51 amazon's infrastructure so in the 01:55 beginning amazon had basically their 02:02 very first offering cloud offering to 02:05 support people who wanted to build 02:06 websites but using Amazon's hardware and 02:09 in Amazon's machine room their first 02:11 offering was something called ec2 02:14 for elastic cloud apparently too and the 02:20 idea here is that Amazon had big machine 02:21 rooms full of servers and they ran 02:23 virtual machine monitors on their 02:25 servers and they'd rent out virtual 02:26 machines to their customers and their 02:30 customers would then you know rent a 02:32 bunch of virtual machines and run web 02:34 servers and databases and whatever ever 02:36 all else they needed to run inside these 02:39 ec2 instances so the picture of one 02:42 physical server looked like this Amazon 02:47 we control the virtual machine monitor 02:50 on this hardware server and then there'd 02:53 be a bunch of guests a bunch of ec2 02:54 instances each one rented out to a 02:57 different cloud customer each of these 02:59 would just run a standard operating 03:01 system like Linux and then you know a 03:06 web server or maybe a database server 03:11 and these were relatively cheap 03:14 relatively easy to set up and as a very 03:17 successful service so one little detail 03:22 that's extremely important for us is 03:23 that initially the way you get storage 03:28 the way you've got storage if you rented 03:30 an ec2 instance was that every one of 03:33 their servers had a disk attached a 03:35 physical disk attached and each one of 03:38 these instances that they rented to 03:41 their customers will get us you know a 03:43 slice of the disk so they said locally 03:46 attached storage and you got a bit of 03:48 locally attached storage which itself 03:50 just look like a hard drive an emulated 03:52 hard drive to the virtual machine guests 03:56 ec2 is like perfect for web servers for 04:00 stateless web servers you know your 04:02 customers with their web browsers would 04:04 connect to a bunch of rented ec2 04:07 instances that ran a web server and if 04:10 you added all of a sudden more customers 04:12 you could just instantly rent more ec2 04:14 instances from 04:15 Amazon and fire up web servers on them 04:17 and sort of an easy way to scale up your 04:20 ability to handle web load so it was 04:23 good for web servers 04:26 but the other main thing that people ran 04:27 in ec2 instance this was databases 04:30 because usually a website is constructed 04:32 of a set of stateless web servers that 04:34 anytime they need to get out permanent 04:37 data go talk to a back-end database so 04:40 what you would get is is maybe a bunch 04:43 of client browsers in the outside world 04:48 outside of Amazon's web infrastructure 04:50 and then a number of ec2 web server 04:56 instances as many as you need it to run 04:58 the sort of logic of the website this 05:00 this is now inside Amazon and then also 05:05 some also typically one ec2 instance 05:10 running a database your web servers 05:13 would talk to your database instance and 05:15 ask it to read and write records in the 05:16 database unfortunately ec2 wasn't 05:19 perfect was it nearly as well-suited to 05:22 running a database as it was to running 05:24 web servers and the most immediate 05:25 reason is that the storage or the sort 05:29 of main easy way to get storage for your 05:32 ec2 database instance was on the locally 05:35 attached disk attached to whatever a 05:39 piece of hardware your database instance 05:41 was currently running on in fact 05:44 hardware crashed then you also lost 05:46 access to whatever what is on its hard 05:48 drive so if it's a hardware that it was 05:51 actually implementing a web server 05:54 crashed no problem at all because 05:55 there's really keeps no state itself you 05:57 just fire up a new web server on a new 05:59 ec2 instance if the ec2 instance it's a 06:02 hardware running it crashes have become 06:04 unavailable you have a serious problem 06:06 if the data is stored on the locally 06:08 attached disk so initially at least 06:13 there wasn't sort of a lot of help for 06:15 doing this one thing that did work out 06:17 well is that Amazon did provide this 06:19 sort of large scheme for storing large 06:22 chunks of data called s3 and you could 06:24 take snapshots you could take Prius 06:25 periodic snapshots if you need a basis 06:27 state and stored in s3 and use that for 06:30 sort of backup disaster recovery but you 06:34 know that style of periodic snapshots 06:36 means you're gonna lose updates that 06:38 happen 06:39 between the periodic backups all right 06:43 so the next thing that came along that's 06:45 that's relevant to the sort of Aurora 06:47 database story is that in order to 06:51 provide their customers with disks for 06:55 their ec2 instances that didn't go away 06:57 if there was a failure that is more sort 06:59 of fault tolerant long-term storage was 07:01 guaranteed to be there Amazon introduced 07:04 the service called EBS and this stands 07:08 for elastic block store 07:09 so with EBS is is a service that looks 07:12 to an ec2 instances it looks to one of 07:16 these instances one of these guest 07:17 virtual machines just as if it were a 07:19 hard drive an ordinary way you could 07:21 format it as a hard drive but a file 07:24 system like ext3 or whatever Linux file 07:27 system you like on this on this thing 07:28 that looks to be guest just like a hard 07:30 drive but the way it's actually 07:31 implemented is as a replicated pair of 07:35 storage servers so this is the local 07:40 this is one of local storage with Mike 07:43 if when EBS came out then you could you 07:47 could rent an e BS volume which this 07:49 thing that looks just like an ordinary 07:50 hard drive but it's actually implemented 07:53 as a pair so these are EBS servers a 07:59 pair of EBS servers each with an 08:03 attached hard drive so if your software 08:09 here maybe you're running a database now 08:10 and your databases mount's one of these 08:13 EBS volumes as its storage when the 08:15 database server doesn't write what that 08:16 actually means is that the right to send 08:18 out over the network and using chain 08:19 replication which we talked about last 08:21 week you're right is you know first 08:24 written to the EBS server one on the 08:27 first CBS server that's backing your 08:28 volume and then the second one and 08:30 finally you get the reply and similarly 08:33 when you do a read I guess some chain 08:35 replication you'll be the last of the 08:37 chain so now database is running on ec2 08:41 instances had available a storage system 08:44 that actually would survive the crash of 08:46 or the you know death of the hardware 08:48 that they were running on if this 08:50 physical server died you could just get 08:53 another ec2 instance fire up your 08:55 database and have it attached to the 08:58 same old EBS volume that the sort of 09:01 previous version of your database was 09:03 attached to and it would see all the old 09:04 data just as it had been left off by the 09:07 previous database just like you moved a 09:10 hard drive from one machine to another 09:11 so EBS was like really a good deal for 09:14 people who need it to keep permanent 09:16 state like people running databases one 09:26 thing to that is sort of important for 09:29 us about EBS is that it's really it's 09:33 not a system for sharing at any one time 09:36 only one ec2 instance only one virtual 09:41 machine can mount a given EBS volume so 09:43 the EBS volumes are implemented on a 09:45 huge fleet of you know hundreds or 09:47 whatever storage servers with disks at 09:49 Amazon and they're all you know 09:52 everybody's EBS volumes are stored on 09:55 this big pool of servers but each one of 09:58 each PPS volume can only be used by only 10:01 one ec2 instance only one customer all 10:08 right still EBS was a big step up but it 10:13 had still has some problems so there's 10:18 still some things that are not quite as 10:19 perfect as it could be one is that if 10:22 you run a database on EBS it ends up 10:24 sending large volumes of data across the 10:27 network and this is uh we're now 10:31 starting to sort of sneak up on figure 10:33 two in the the paper where they start 10:36 complaining about how many just how many 10:38 writes it takes if you run a database on 10:40 top of a network storage system so 10:45 there's the database on EBS ended up 10:48 generating a lot of network traffic and 10:50 one of the kind of things in the paper 10:53 that the paper implies is that they're 10:55 as much network limited as they are CPU 10:59 or storage limited that is they pay a 11:01 huge amount of attention to reducing the 11:03 Aurora paper sends a huge amount of 11:05 attention for reducing the network 11:07 that the database generates and seems to 11:09 be worrying less about how much CPU time 11:12 or disk space is being consumed that's a 11:15 sort of a hint at what they think is 11:18 important the other problem with EBS is 11:20 not very fault tolerant it turns out 11:22 that for performance reasons they I'm 11:25 done would always put both of the EBS 11:26 both of the replicas of your EBS volume 11:29 in the same data center and so we have a 11:32 single server crashed if you know one of 11:34 the two EBS servers that you're using 11:36 crashed it's okay because you switch to 11:37 the other one but there was just no 11:39 story at all for what happens if an 11:40 entire data center went down and and 11:50 apparently a lot of customers really 11:53 wanted a story that would allow their 11:55 data to survive an outage of an entire 11:57 data center maybe it lost his network 12:00 connection it was a fire in the building 12:01 or a power failure to the whole building 12:04 or something people really wanted to 12:05 have at least the option if they're 12:07 willing to pay more of having their data 12:09 stored in a way they hid they could 12:10 still get at it I'm even if one data 12:13 center goes down and the way that Amazon 12:20 described this there is that both an 12:25 instance and its EBS to EBS replicas are 12:29 in the same ability veil ability zone 12:32 and an Amazon jargon an availability 12:34 zone is a particular data center and the 12:36 way they structure their data centers is 12:38 that there's usually multiple 12:42 independent data centers in more or less 12:44 the same city or relatively close to 12:46 each other and all the multiple 12:50 availability zones maybe two or three 12:52 that are near by each other are all 12:54 connected by redundant high speed 12:56 networks so there's always payers or 12:58 triples of nearby availability 13:00 availability centers and we'll see the 13:01 buy that's important in a little bit but 13:03 at least for EBS in order to keep the 13:05 sort of costs of using chain replication 13:08 down they required the two replicas to 13:12 be in the same availability zone 13:16 all right um before I dive into more 13:21 into how Aurora actually works it turns 13:27 out that the details of the design in 13:31 order to understand them we first have 13:32 to know a fair amount about the sort of 13:34 design of typical databases because what 13:36 they taken is sort of the main machinery 13:40 of a database my sequel as it happens 13:42 and split it up in an interesting way so 13:44 we need to know sort of what it but it 13:46 is a database does so we can understand 13:48 how they split it up so this is really a 13:50 kind of database tutorial really 13:58 focusing on what it takes to implement 14:01 transactions crashed recoverable 14:04 transactions so what I really care about 14:06 is transactions and crash recovery and 14:14 there's a lot else going on in databases 14:17 but this is really the part that matters 14:18 for this paper so first what's a 14:22 transaction you know transaction is just 14:24 a way of wrapping multiple operations on 14:27 maybe different pieces of data and 14:28 declare in that that that's entire 14:31 sequence of operations should appear a 14:33 Tomic to anyone else who's reading or 14:35 writing the data so you might see 14:38 transposing we're running a bank and we 14:40 want to do transfers between different 14:43 accounts maybe you would say well we 14:46 would see code or you know see a 14:48 transaction looks like this is you have 14:50 to clear the beginning of the sequence 14:51 of instructions that you want to be 14:53 atomic in the in transaction maybe we're 14:55 going to transfer money from account Y 14:59 to account X so we might see where I'll 15:02 just pretend X is a bank balance Jordan 15:05 the database you might see the 15:06 transaction looks like oh can I add $10 15:08 to X's account and 15:11 deduct the same ten dollars from my 15:14 account and that's the end of the 15:16 transaction 15:17 I want the database to just do them both 15:19 without allowing anybody else to sneak 15:21 in and see the state between these two 15:24 statements and also with respect to 15:27 crashes if there's a crash at this point 15:29 somewhere in here we're going to make 15:31 sure that after the crash and recovery 15:32 that either the entire transactions 15:34 worth the modifications are visible or 15:36 none of them are so that's the effect we 15:40 want from transactions there's 15:41 additionally people expect database 15:44 users expect that the database will tell 15:46 them tell the client that submitted the 15:48 transaction whether the transaction 15:51 really finished and committed or not and 15:52 if a transaction is committed we expect 15:55 clients expect that the transaction will 15:58 be permanent will be durable still there 15:59 even if the database should crash and 16:02 reboot um one thing it's a bit important 16:05 is that the usual way these are 16:08 implemented is that the transaction 16:10 locks each piece of data before it uses 16:12 it so you can view the they're being 16:15 locks x and y for the duration of the 16:20 transaction and these are only released 16:22 after the transaction finally commits 16:24 that is known to be permanent this is 16:29 important if you for some of the things 16:31 that you have to if you some of the 16:33 details in the paper really only makes 16:35 sense if you realize that the database 16:36 is actually locking out other access to 16:38 the data during the life of a 16:40 transaction so how this actually 16:43 implemented it turns out the database 16:48 consists of at least for the simple 16:53 database model where the databases are 16:55 typically written to run on a single 16:56 server with you know some storage 16:58 directly attached and a game that the 17:00 Aurora paper is playing is sort of 17:01 moving that software only modestly 17:05 revised in order to run on a much more 17:07 complex network system but the starting 17:09 point is we just assume we have a 17:11 database with a attached to a disk the 17:16 on disk structure that stores these 17:18 records is some kind of indexing 17:21 structure like a b-tree maybe so 17:24 there's a sort of pages with the paper 17:25 calls data pages that holds us you know 17:27 real data of the of the database you 17:32 know maybe this is excess balances and 17:34 this is wise balance these data pages 17:36 typically hold lots and lots of records 17:40 whereas X and y are typically just a 17:42 couple bites on some page in the 17:44 database so on the disk there's the 17:46 actual data plus on the disk there's 17:49 also a right ahead log or wal and the 17:55 right ahead logs are a critical part of 17:57 why the system is gonna be fault 18:00 tolerant inside the database server 18:03 there's the database software the 18:05 database typically has a cache of pages 18:08 that it's read from the disk that it's 18:11 recently used when you execute a 18:13 transaction what that actually executes 18:15 these statements what that really means 18:16 is you know what x equals x plus 10 18:19 turns into the runtime is that the 18:21 database reads the current page holding 18:23 X from the disk and adds 10 to it but so 18:27 far until the transaction commits it 18:29 only makes the modifications in the 18:31 local cache not on the disk because we 18:34 don't want to expose we don't want to 18:35 write on the disk yet and the part 18:37 possibly expose a partial transaction so 18:42 while then when the database but before 18:46 because the database wants to sort of 18:47 pre to clear the complete transaction so 18:50 it's available to the software after a 18:53 crash and during recovery before the 18:56 database is allowed to modify the real 18:57 data pages on disk its first required to 18:59 add log entries that describe the 19:03 transaction so it has to in order before 19:06 it can commit the transaction it needs 19:07 to put a complete set of log ahead 19:09 entries in the right ahead log on disk 19:11 I'm describing all the data bases 19:13 modification so let's suppose here that 19:15 x and y start out as say 500 and y 19:20 starts out as 750 and we want to execute 19:24 this transaction before committing and 19:26 before writing the pages the database is 19:29 going to add at least typically 3 log 19:31 records 1 this that says well as part of 19:34 this transaction I'm modifying X 19:37 and it's old value is 500 make more room 19:43 here this is the on dis log so each log 19:50 entry might say here's the value I'm 19:52 modifying here's the old value and we're 19:56 adding and here's the new value say five 19:58 ten so that's one log record another 4y 20:02 may be old value is 750 we're 20:04 subtracting 10 so the new value is 740 20:07 and then when the database if it 20:11 actually manages to get to the end of 20:12 the transaction before crashing its 20:14 gonna write a commit record saying and 20:18 typically these are all tagged with some 20:20 sort with a transaction ID so that the 20:23 recovery software eventually will know 20:24 how this commit record refers to these 20:27 log records yes 20:36 in a simple database will be enough to 20:38 just store the new values and say well 20:41 it is a crash we're gonna just reapply 20:43 all the new values the reason most 20:47 serious databases store the old as well 20:50 as a new value is to give them freedom 20:52 to even for a long-running traction for 20:56 a long-running transaction even before 20:57 the transaction is finished it gives the 20:59 database the freedom to write the 21:00 updated page to disk with the new value 21:04 740 let's say from the from an 21:07 uncompleted transaction as long as it's 21:10 written the log record to disk and then 21:11 if there's a crash before the commit the 21:13 recovery software always say aha well 21:15 this transaction never finished 21:17 therefore we have to undo all of its 21:19 changes and these values these old 21:21 values are the values you need in order 21:22 to undo a transaction that's been 21:24 partially written to the data pages so 21:26 the aurora indeed uses undo redo logging 21:32 to be able to undo partially applied 21:35 transactions okay so if the database 21:40 manages to get as far as getting the 21:42 transactions log records on the disk and 21:44 the commit record marking is finished 21:46 then it is entitled to apply to the 21:48 client we said the transactions 21:50 committed the database can reply to the 21:51 client and the client can be assured 21:53 that its transaction will be sort of 21:56 visible forever and now one of two 21:59 things happens the database server 22:01 doesn't crash then eventually so it's 22:04 modified in its cache these these X&Y 22:08 records to be 510 and 740 eventually the 22:13 database will write it's cached updated 22:15 blocks to their real places on the disk 22:18 over writing you know these be tree 22:20 nodes or something and then the database 22:22 can reuse this part of the log so 22:26 databases tend to be lazy about that 22:27 because they like to accumulate you know 22:30 maybe there'll be many updates to these 22:32 pages in the cache it's nice to 22:34 accumulate a lot of updates before being 22:37 forced to write the disk if the database 22:39 server crashes before writing the day 22:41 writing these pages to the disk so they 22:43 still have their old values then it's 22:47 guaranteed that the recovery software 22:49 when you restart that 22:49 debase scan the log see these records 22:53 for the transaction see that that 22:54 transaction was committed and apply the 22:58 new values to the to the stored data and 23:03 that's called a redo it basically does 23:07 all the rights in the transaction so 23:11 that's how transactional databases work 23:15 in a nutshell and so this is a sort of 23:18 very extremely abbreviated version of 23:22 how for example the my sequel database 23:25 works that an Aurora is based on this 23:28 open source software thing called 23:30 database called my sequel which does 23:32 crash recovery transaction and crash 23:34 recovery in much this way ok so the next 23:40 step in Amazon's development a better 23:44 and better database infrastructure for 23:46 its cloud customers is something called 23:50 RDS and I'm only talking about RDS 23:53 because it turns out that even though 23:55 the paper doesn't quite mention it 23:56 figure 2 in the paper is basically a 23:58 description of RDS so what's going on 24:01 and RDS is that it was a first attempt 24:04 to get a database that was replicated in 24:07 multiple availability zones so that if 24:09 an entire data center went down you 24:12 could get back your database contents 24:14 without missing any rights so that deal 24:16 with RDS is that there's one you have 24:20 one ec2 instance that's the database 24:22 server 24:23 you just have one you just want to 24:24 running one database it stores its data 24:28 pages and log just basically with this 24:31 instead of on the local disk its stores 24:34 them in EBS so whenever the database 24:36 does a log write or page write or 24:38 whatever those rights actually go to 24:40 these two EBS volumes EBS replicas in 24:47 addition so and so this is in one 24:50 availability zone in addition for every 24:54 write that the database software does 24:55 Amazon would transparently without the 24:58 database even realizing necessarily this 24:59 was happened also send those rights to 25:03 a special set up in a second 25:06 availability zone in a second machine 25:07 room - just going from figure 2 to 25:13 apparently a separate computer or ec2 25:14 instance or something whose job was just 25:16 a mirror writes that the main database 25:20 did so this other sort of mirroring 25:22 server would then just copy these rights 25:25 to a second pair of EBS servers and so 25:30 with this set up with this RDS set up 25:33 and that's what figure - every time the 25:36 database appends to the log or writes to 25:38 one of its pages it has to the data has 25:43 to be sent to these two replicas has to 25:44 be sent on the network connection across 25:47 the other availability zone on the other 25:49 side of town sent to this mirroring 25:51 server which would then send it to it's 25:53 two separate EBS replicas and then 25:56 finally this reply would come back and 25:59 then only then with the right be 26:00 finished with a DAT bc AHA my writes 26:03 finished I can you know count this log 26:07 record it was really being appendage of 26:08 the log or whatever 26:09 so this RDS arrangement gets you betcha 26:13 better fault tolerance because now you 26:14 have a complete up-to-date copy of the 26:17 database like seeing them all the very 26:18 latest writes in a separate availability 26:21 zone even if you know fire burns down 26:23 this entire data center boom you can 26:26 weaken you can run the database in a new 26:28 instance and the second availability 26:30 zone and lose no data at all yes 26:45 um I don't know how to answer that I 26:48 mean that is just not what they do and 26:51 my guess is that it would be that for 26:54 most EVs customers it would be too 26:56 painfully slow to forward every right 26:58 across two separate data center I'm not 27:02 really sure what's going on but I think 27:04 the main answers they don't do that and 27:06 this is sort of a a little bit of a 27:09 workaround for the way EBS works too 27:11 kind of tricky BS and actually producing 27:14 and sort of using the existing EBS 27:17 infrastructure unchanged I stableman 27:20 chose this turns out to be extremely 27:24 expensive or anyway it's expensive as 27:28 you might think 27:29 you know we're writing fairly large 27:30 volumes of data because you know even 27:33 this transaction which seems like it 27:36 just modifies two integers like maybe 27:38 eight bytes or I don't know what sixteen 27:41 who knows only a few bytes of data are 27:43 being modified here what that translates 27:45 to as far as the database reading and 27:46 writing the disk is I actually these log 27:49 records are that also quite small so 27:51 this these two log records might 27:53 themself only be dozens of bytes long so 27:55 that's nice but the reads and writes of 27:57 the actual data pages are likely to be 27:58 much much larger than just a couple of 28:02 dozen bytes because each of these pages 28:03 is going to be you know eight kilobytes 28:05 or 16 kilobytes or some relatively large 28:08 number the file system or disk block 28:10 size and it means that just to read and 28:14 write these two numbers when it comes 28:17 time to update the data pages there's a 28:19 lot of data being pushed around on to 28:21 the disk a locally attached disk now 28:23 it's reasonably fast but I guess what 28:26 they found is when they start sending 28:27 those big 8 kilobyte writes across the 28:30 network that that used up too much 28:34 network capacity to be supported and so 28:37 this arrangement this figure 2 28:40 arrangement evidently was too slow yes 28:51 so in this in this figure to set up the 28:56 you know unknown to the database server 28:58 every time it called write erode its EBS 29:02 disk a copy of every write went over 29:05 across availabilities zones and had to 29:08 be written to the was written to the 29:10 both of these EBS servers and then 29:12 acknowledged and only then did the write 29:15 appear to complete to the database so I 29:18 really had to wait for all the fall for 29:19 copies to be updated and for the data to 29:22 be sent on the link across to the other 29:24 availability zone and you know as far as 29:30 table one it's concerned that first 29:33 performance table the reason why the 29:39 reason why the slow the mirrored my 29:42 sequel line is much much slower than the 29:45 Aurora line is basically that it sends 29:47 huge amounts of data over these 29:50 relatively slow Network links and that 29:52 was the problem that was the performance 29:54 problem they're really trying to fix so 29:55 this is good for fault tolerance because 29:57 now we have a second copy and another 29:59 availability zone but it was bad news 30:02 for performance all right the way Aurora 30:05 and the next step after this is Aurora 30:07 and to set up there the high level view 30:14 is we still have a database server 30:15 although now it's running custom 30:18 software that Amazon supplies so I can 30:21 rent an Aurora server from Amazon but 30:23 it's not I'm not running my software on 30:26 it I'm renting a server running Amazon's 30:28 Aurora database software on it rent an 30:32 Aurora database server from them and 30:35 it's it's just one instance it sits in 30:38 some availability zone and there's two 30:44 interesting things about the way it's 30:46 set up first of all is that the data you 30:52 know it's replacement basically for EBS 30:54 involves six replicas now 30:59 - in each of three availability zones 31:09 for super fault tolerance and so every 31:12 time the database complicated we'll talk 31:14 but basically when the database writes 31:15 or reads when the database writes it's 31:19 we're not sure exactly how its managed 31:22 but it more or less needs to send a 31:24 write one way or another writes have to 31:27 get sent to all six of these replicas 31:31 the key to making and so this looks like 31:33 more replicas gosh you know why isn't it 31:35 slower why isn't it slower than this 31:37 previous scheme which only had four 31:38 replicas and the answer to that is that 31:41 what's being the only thing being 31:43 written over the network is the log 31:44 records so that's really the key to 31:47 success is that the data that goes over 31:50 these links in the sense of the replicas 31:51 it's just the log records log entries 31:58 and as you can see you know a log entry 32:02 here you know at least and this is a 32:04 simple example now it's not quite this 32:06 small but it's really not vastly more 32:08 than a couple of dozen bytes needed to 32:10 store the old value and the new value 32:11 for the piece of data we're writing so 32:14 the log entries tend to be quite small 32:16 whereas when the database you know we 32:20 had a database that thought it was 32:21 writing a local disk and it was updating 32:23 its data pages these tended to be 32:24 enormous like doesn't really say in the 32:26 paper I don't think that eight kilobytes 32:28 or more so this set up here was sending 32:31 for each transaction was sending 32:33 multiple 8 kilobyte pages across to the 32:36 replicas whereas this set up is just 32:38 sending these small log entries to more 32:41 replicas but the log entries are so very 32:43 much smaller than 8k pages that it's a 32:46 net performance win okay so that's one 32:51 this is like one of their big insights 32:56 is just in the log entries of course a 32:58 fallout from this is that their storage 33:00 system is now not very general purpose 33:01 this is a storage system that 33:03 understands what to do with my sequel 33:06 log entries right it's not just you know 33:09 EBS was a very general purpose just 33:11 emulated to disk you read them right 33:13 block's doesn't understand anything 33:15 about anything except for blocks this is 33:17 a storage system that really understands 33:19 that it's sitting underneath the 33:20 database so that's one thing they've 33:23 done is ditched general-purpose storage 33:25 and switched to a very application 33:28 specific storage system 33:31 the other big thing I'll also go into in 33:34 more detail is that they don't require 33:36 that the rights be acknowledged by all 33:40 six replicas in order for the database 33:43 server to continue instead the database 33:47 server can continue as long as a quorum 33:49 and which turns out to be for as long as 33:51 any four of these servers responds so if 33:54 one of these availability zones is 33:57 offline or maybe the network connection 33:59 to it is slow or maybe even just these 34:02 servers just happen to be slow doing 34:04 something else at the moment we're 34:05 trying to write the database server can 34:08 basically ignore the two slowest or the 34:12 two most dead of the server's when it's 34:14 doing it right so it only requires 34:16 acknowledgments from any four out of six 34:17 and then it can continue and so this 34:19 quorum scheme is the other big trick 34:25 they use to help them have more replicas 34:30 in more availability zones and yet not 34:33 pay a huge performance penalty because 34:35 they never have to wait for all of them 34:36 just the four fastest of the six 34:39 replicas so the rest of the lecture is 34:45 gonna be explaining first quorums and 34:47 then log entries and then this idea of 34:49 just sending log entries basically table 34:53 one summarizes the result if you look at 34:54 table one by switching from this 34:56 architecture in which they send the big 34:58 data pages to four places to this Aurora 35:02 schema sending just the log entries to 35:04 six replicas they get a amazing 35 times 35:08 performance increase over some other 35:11 system you know this system over here 35:15 but by playing these two tricks and 35:17 paper is not very good about explaining 35:19 how much of the performance is due to 35:21 quorums and how much is due to just 35:23 sending log entries but anyway you slice 35:25 it 35 35:27 times improvement performance is very 35:31 respectable and of course extremely 35:33 valuable to their customers and to them 35:34 and it's like transformative I am sure 35:37 for many of Amazon's customers all right 35:44 okay so the first thing I want to talk 35:46 about in in detail is their quorum 35:50 arrangement what they actually mean by 35:52 quorums so first of all the quorums is 35:55 all about the arrangement of 35:57 fault-tolerant of this fault-tolerant 35:59 storage so it's worth thinking a little 36:03 bit about what their fault tolerance 36:05 goals were so this is like fault 36:09 tolerance goals they wanted to be able 36:15 to do rights even if one reads and 36:18 writes even if one availability zone was 36:21 completely dead so they're gonna write 36:26 you know even with they wanted to be 36:37 able to read even if there was one dead 36:40 availability zone plus one other dead 36:43 server and the reason for this is that 36:46 an availability zone might be offline 36:48 for quite a while because maybe it's you 36:50 know was suffered from a flood or 36:52 something and while it's down for a 36:54 couple of days or a week or something 36:56 well people prepare the damage from the 36:58 flood we're now reliant on just you know 37:00 the servers and the other two 37:01 availability zones if one of them should 37:03 go down we still we don't want it to be 37:05 a disaster so they're going to be able 37:09 to write with one even with one dead 37:11 availability zone they furthermore they 37:13 wanted to be able to read with one dead 37:16 availability zone plus one other dead 37:19 server so they wanted to be able to 37:20 still read you know and get the correct 37:23 data even if there was one dead 37:26 availability zone plus one other server 37:28 and the live availability zones were 37:30 dead so you know they we have to sort of 37:34 take take it for granted that they know 37:36 what their they know their own business 37:38 and that this is really 37:40 you know kind of a sweet spot for how 37:43 fault-tolerant you want to be um and in 37:46 addition I already mentioned they want 37:47 to be able to taller to sur ride out 37:49 temporarily slow replicas I think from a 37:55 lot of sources it's clear that the if 37:58 you read and write EBS for example you 38:01 don't get consistently high performance 38:03 all the time sometimes there's little 38:04 glitches because maybe some part of the 38:06 network is overloaded or something is 38:08 doing a software upgrade or whatever and 38:10 it's temporarily slow so they want to be 38:13 able to just keep going despite 38:15 transient transiently slow or maybe 38:21 briefly unavailable storage servers and 38:27 a final requirement is that if something 38:30 if a storage server should fail it's a 38:33 bit of a race against time before the 38:36 next storage server fails sort of always 38:39 the case and it's not the statistics are 38:42 not as favorable as you might hope 38:44 because typically you buy basically 38:47 because server failure is often not 38:50 independent like the fact that one 38:53 server is down often means that there's 38:56 a much increased probability that 38:58 another one of your servers will soon go 39:00 down because it's identical Hardware may 39:03 be bought from the same company came off 39:05 the same production line one after 39:07 another and so a flaw and one of them is 39:09 extremely likely to be reflected in a 39:11 flaw and another one so people always 39:14 nervous off there's one failure boy 39:16 there could be a second failure very 39:17 soon and in a system like this well it 39:21 turns out in these quorum systems you 39:24 know you can only recover it's a little 39:26 bit like raft you can recover as long as 39:28 not too many of the replicas fail so 39:31 they really needed to have fast we 39:35 replicate them that is of one server 39:37 seems permanently dead we'd like to be 39:38 able to generate a new replica as fast 39:41 as possible from the remaining replicas 39:43 I mean a fast food replication 39:48 these are the main fault tolerance goals 39:50 the peeper lays out and by the way this 39:56 discussion is only about the storage 39:57 servers and you know what their failure 40:00 character is too excited you know the 40:01 failures how to recover and it's a 40:03 completely separate topic what to do if 40:05 the database server fails and Aurora has 40:10 a totally different set of machinery for 40:17 noticing a database servers fail 40:19 creating a new instance running in a new 40:20 database server on the new instance 40:22 which is intense it's not what I'm 40:24 talking about right now we'll talk about 40:25 it a little bit later on right now it's 40:27 just gonna build a storage system that's 40:29 a lot that's where the storage system is 40:32 fault tolerant okay so they use this 40:36 idea called quorums and for a little 40:43 while now I'm going to describe the sort 40:45 of classic quorum idea which is dates 40:49 back to the late 70s so this is quorum 40:52 replicate quorum replication I'm gonna 40:57 describe to you this or abstract quorum 40:59 idea they use a variant of what I'm 41:02 gonna explain and the idea of behind 41:05 quorum quorum systems is to be able to 41:07 build storage systems that provide fault 41:10 tolerance storage using replications and 41:13 guarantee that even if some of the 41:15 replicas fail your that reads will still 41:18 see the most recent writes and typically 41:22 quorum systems are sort of simple 41:25 readwrite systems put get systems and 41:28 they don't typically directly support 41:31 more complex operations just you can 41:33 read you could have objects you can read 41:34 an object or you can overwrite an entire 41:36 object and so the idea is you have n 41:38 replicas if you want to write or you 41:48 have to get you have to in order to 41:49 write you have to make sure your write 41:51 is acknowledged by W where W is less 41:53 than n of the replicas so W 41:58 right you have to send each right to 42:02 these W are the replicas and if you want 42:04 to do a read you have to get input read 42:07 information from at least our replicas 42:14 and so a typical setup that's so well 42:20 first of all the key thing here is that 42:23 W and our have to be set relative to end 42:27 so that any quorum of W servers that you 42:31 manage to send a right to must 42:33 necessarily overlap with any quorum of 42:36 our servers that any future reader might 42:38 read from and so what that means is that 42:42 our plus W has to be greater than n so 42:50 that any W servers must overlap in at 42:52 least one server with any our servers 42:58 and so you might have three we can 43:01 imagine there's three servers s1 s2 s3 43:05 each of them holds I say we just have 43:08 one object that we're updating we send 43:10 out a write maybe we want to set the 43:11 value of our object to 23 well in order 43:15 to do a write we need to get our new 43:17 value on to at least W of the of the 43:22 replicas let's say for this system that 43:24 R and W are both equals 2 and n is equal 43:29 to 3 that's the setup to do a write we 43:32 need to get our new value onto a quorum 43:35 onto a beast to the server so maybe we 43:38 get our right onto these two so they 43:40 both now know that the value of the of 43:43 our data object is 23 if somebody comes 43:47 along and reads or read it also requires 43:51 that the reader check with at least a 43:53 read quorum of the servers so that's 43:55 also 2 in this set up so you know that 43:58 quorum could include a server that 44:00 didn't see the right but it has to 44:02 include at least one other in order to 44:03 get to so that means the any future read 44:07 must for example consult both this 44:09 server that didn't see the write plus at 44:11 least one that did 44:12 that is a requirement of right form must 44:14 overlap in at least one server so any 44:17 read must consult a server that saw any 44:20 previous right now what's cool about 44:31 this well actually there's still one 44:34 critical missing piece here the reader 44:38 is gonna get back our results possibly 44:41 are different results because and the 44:44 question is how does a reader know which 44:46 of the our results it got back from the 44:48 our servers in its forum which one 44:51 actually uses the correct value 44:55 something that doesn't work is voting 44:57 like just voting by popularity of the 44:59 different values it gets back it turns 45:01 out not to work because we're only 45:03 guaranteed that our reader overlaps of 45:05 the writer in at most one server so that 45:07 could mean that the correct value is 45:09 only represented by one of the servers 45:11 that the reader consulted and you know 45:15 in a system with say six replicas you 45:17 know you might have Reaper might be four 45:19 you might get back for answers and only 45:23 one of them is the answer that is the 45:26 correct answer from the server in which 45:29 you overlap with the previous right so 45:31 you can't use voting and instead these 45:33 quorum systems need version numbers so 45:35 every right every time you do a right 45:38 you need to accompany your new value 45:40 with you know an increasing version 45:42 number and then the reader it gets back 45:45 a bunch of different values from the 45:47 read quorum and it can just use them 45:48 only the highest version number I'm said 45:51 that means that this 21 here 45:53 you know maybe s2 had a old value of 20 45:57 each of these needs to be tagged with a 45:59 version number so maybe this is version 46:01 number three this was also version 46:03 number three because it came from the 46:04 same original right and we're imagining 46:06 that this server that didn't see the 46:08 right is gonna have version number two 46:09 then the reader gets back these two 46:11 values these two version numbers fix the 46:13 version were the highest the value with 46:15 the highest version number and in Aurora 46:18 this was essentially about well never 46:23 mind about Aurora for a moment 46:28 okay furthermore if you can't talk to if 46:33 you can't actually contact a quorum or a 46:35 read or write you really just have to 46:37 keep trying those are the rules so keep 46:41 trying until the server's are brought 46:45 back up or connected again so the reason 46:49 why this is preferable to something like 46:51 chain replication is that it can easily 46:54 ride out temporary dead or disconnected 46:59 or slow servers so in fact the way it 47:01 would work is that if you want to read 47:02 or write if you want to write you would 47:04 saying your newly written about you 47:06 would send the newly written value plus 47:08 its version number to all of the servers 47:11 to all n of the servers but only wait 47:13 for W of them to respond and similarly 47:17 if you want to read you would in a 47:18 quorum system you would send the read to 47:20 all the servers and only wait for a 47:21 quorum for R of the servers to respond 47:23 and that and because you only have to 47:26 wait for are out of n of them that means 47:29 that you can continue after the fastest 47:31 are have responded or the fastest W and 47:35 you don't have to wait for a slow server 47:37 or a server that's dead and there's not 47:39 any you know the machinery for ignoring 47:43 slow or dead servers is completely 47:45 implicit there's nothing here or about 47:47 oh we have to sort of make decisions 47:49 about which servers are up or down or 47:51 like the leaders or anything it just 47:54 kind of automatically proceeds as long 47:57 as the quorum is available so we get 48:02 very smooth handling of dead or slow 48:04 servers in addition there's not much 48:07 leeway for it here well actually you 48:09 even in this simple case you can adjust 48:11 the R and W to make either reads to 48:14 favor either reads or writes so here we 48:17 could actually say that well the right 48:19 forum is three every write has to go to 48:21 all three servers and in that case the 48:23 read quorum can be want so you could if 48:26 you wanted to favored reads with this 48:28 setup you could have read equals one 48:31 write equals three memories are much 48:33 faster they only have to wait for one 48:35 server but then return the writes are 48:37 slow if you wanted to favor right 48:38 you could say that Oh any reader has to 48:40 be from all of them but a writer only 48:42 has to write one so I mean the only one 48:45 server might have the latest value but 48:48 readers have to consult all three but 48:53 they're guaranteed that their three will 48:55 overlap with this of course these 48:57 particular values makes writes not fault 49:00 tolerant and here reads not fault 49:02 tolerant because all the server's have 49:04 to be up so you probably wouldn't want 49:06 to do this in real life you might have 49:08 you would have as Knowle Rohrer does a 49:10 larger number of servers and sort of 49:13 intermediate numbers of vinum right 49:15 corns Aurora in order to achieve its 49:23 goals here of being able to write with 49:26 one debt availability zone and read with 49:30 one dead availability zone plus one 49:32 other server it uses a quorum system 49:35 with N equals 6 w equals 4 and R equals 49:45 3 so the W equals 4 means that it can do 49:48 a write with one dead availability zone 49:51 if this availability zone can't be 49:53 contacted well these other four servers 49:54 are enough to complete right the reform 49:58 of 3 so 4 plus week so 7 so they 50:01 definitely guaranteed overlap a read 50:04 quorum of 3 means that even if one 50:05 availability is zone is dead plus one 50:07 more server the three remaining servers 50:09 are enough to serve a read now in this 50:12 case we're three servers are now down 50:15 the system can do reads and as you know 50:17 can reconstruct the confine the current 50:20 state of the database but it can't do 50:21 writes without further work so if they 50:24 were in a situation where there was 50:28 three dead servers there they have 50:31 enough of a quorum to be able to read 50:33 the data and reconstruct more cop more 50:35 replicas but until they've created more 50:38 replicas to basically replace these dead 50:41 ones they can't serve as rights 50:47 and also the quorum system as I 50:50 explained before allows them to ride out 50:52 these transient slow replicas all right 51:02 as it happens as explained before what 51:07 the rights in Aurora aren't really over 51:09 writing objects as in a sort of classic 51:12 quorum system what Aurora in fact its 51:16 rights never overwrite anything its 51:20 rights just append log entries to the 51:22 current law 51:23 so the way it's using quorums is 51:25 basically to say well when the database 51:27 sends out our new log record because 51:29 it's executing some transaction it needs 51:31 to make sure that that log record is 51:33 present on at least four of the store of 51:38 its storage servers before it's allowed 51:40 to proceed with the transaction are 51:42 committed so that's really the meaning 51:43 of its other Wars right porins is that 51:46 each new log record has to be appended 51:49 to the storage and at least for the 51:50 replicas before the write can be 51:52 considered to to have completed and when 52:01 a when Aurora gets to the end of a 52:03 transaction before it can reply to the 52:05 client until the client tell the client 52:07 a hi you know your transaction is 52:08 committed and finished and durable 52:10 Aurora has to wait for acknowledgments 52:14 from a write quorum for each of the log 52:16 records that made up that transaction 52:18 and in fact because because if there 52:24 were a crash in a recovery you're not 52:25 allowed to recover one transaction if 52:30 preceding transactions don't aren't also 52:33 recovered in practice Aurora has before 52:36 Aurora can acknowledge a transaction it 52:38 has to wait for a write quorum of 52:42 storage servers to respond for all 52:44 previously committed transaction and the 52:46 transaction of interest and then can 52:48 respond to the client 52:54 okay so these these storage servers are 52:57 getting incoming log records 52:59 that's what rights look like to them and 53:02 so what do they actually do you know 53:04 they're not getting new data pages from 53:06 the database server they're just getting 53:07 log records that just describe changes 53:10 to the data pages so internally one of 53:16 these one of these storage servers it 53:22 has internally it has copies of all that 53:25 data of all the data pages at some point 53:30 in the database data pages evolution so 53:34 it has maybe in its cache on its disk a 53:39 whole bunch of these pages you know page 53:41 1 page 2 so forth when a new write comes 53:47 in the storage server would win a new 53:52 log rec over in a new write arrives 53:54 carrying with it just a log record what 53:56 has to happen some day but not right 53:58 away is that the changes in that log 54:00 record the new value here has to be 54:02 applied to the relevant page but we 54:05 don't at the source of it doesn't have 54:06 to do that until someone asks just until 54:09 the database server or the recovery 54:11 software asks to see that page so 54:13 immediately what happens to a new log 54:15 record is that the log records are just 54:17 appended to lists of log records that 54:20 effect each page so for every page that 54:23 the storage server stores if it's been 54:26 recently modified by a log record by a 54:29 transaction what the storage server will 54:31 actually store is an old version of the 54:34 page plus the string of the sequence of 54:37 log records that have come in from trend 54:40 from the database server since that page 54:42 was last brought up to date so if 54:45 nothing else happens the storage server 54:47 just stores these old pages plus lists 54:49 of log records if the database server 54:53 later you know fix the page from its 54:56 cache and then needs to read the page 54:58 again for a future transaction it'll 55:00 send a read request out to one of the 55:03 storage servers and say look you know I 55:04 need a copy I need an updated copy a 55:06 page one 55:06 and at that point the storage server 55:09 will apply these log records to the page 55:12 you know do do these writes of new data 55:15 that are implied that are described in 55:18 the log records and then send that 55:19 updated page back to the database server 55:22 and presumably maybe then like a racist 55:27 list and just store the newly updated 55:29 page although it's not quite that simple 55:35 all right so the storage servers just 55:37 store these strings of log records plus 55:41 old log page versions now the database 55:53 server as I mentioned sometimes needs to 55:54 read pages so by the way one thing to 55:57 observe is that the database server is 55:58 writing log records but it's reading 56:00 data pages so there's also different my 56:03 corns poram system in the sense that the 56:05 sort of things that are being read and 56:07 written are quite different in addition 56:09 it turns out that in ordinary operation 56:11 the database server knows doesn't have 56:16 to send quorum reads because the 56:20 database server tracks for each one of 56:23 the storage servers how far how much of 56:27 the prefix of the log that storage 56:29 server is actually received so the 56:32 database server is keeping track of 56:34 these six numbers so so first of all log 56:36 entries are numbered just one two three 56:37 four five the database server sends that 56:40 new log entries to all the storage 56:42 servers the storage servers that receive 56:44 them respond saying oh yeah I got log 56:45 entries 79 and furthermore you know I 56:48 have every log entry before 79 also the 56:51 database server keeps track of these 56:52 numbers how far each server has gotten 56:56 or what the highest sort of contiguous 56:59 log entry number is that each of the 57:02 servers has gotten so that way when the 57:04 database server needs to do a read it 57:06 just picks a storage server that's up to 57:09 date and sends the read request for the 57:12 page it wants just to that storage 57:14 server so the the database server does 57:18 have to do quorum writes but it 57:19 basically 57:20 doesn't ordinarily have to do quorum 57:22 reads and knows which of these storage 57:23 servers are up to date and just reads 57:25 from one of them so the reason I keep ur 57:27 than they would be in a that just reads 57:30 one copy of the page and doesn't have to 57:32 go through the expense of a quorum read 57:36 now it does sometimes use quorum reads 57:39 it turns out that during crash recovery 57:41 you know if the crash during crash 57:44 recovery of the database server and so 57:46 this is different from a crash recovery 57:49 of the storage service if the database 57:50 server itself sir crash in me because 57:53 the it's running in an ec2 instance on 57:55 some piece of hardware some real piece 57:57 of hardware may be that piece of 57:58 hardware suffers a failure the database 58:01 server crashes there's some monitoring 58:02 infrastructure at Amazon that says oh 58:04 wait a minute you know the database the 58:06 Aurora database server over running for 58:08 a customer or whatever just crashed and 58:12 Amazon will automatically fire up a new 58:15 ec2 instance start up the database 58:18 software and that ec2 instance and sort 58:20 of tell it look your data is sitting on 58:23 this particular volume this set of 58:26 storage systems please clean up any 58:29 partially executed transactions that are 58:32 evident in the logs stored in these 58:34 storage servers and continue so we have 58:38 to and that's the point at which Aurora 58:44 uses quorum logic for weeds because this 58:48 database server when the old when the 58:52 previous database server crashed it was 58:54 almost certainly partway through 58:56 executing some set of transactions so 58:59 the state of play at the time of the 59:00 crash was well it's completed some 59:01 transactions and committed them and 59:03 their log entries are on a quorum plus 59:06 it's in the middle of executing some 59:09 other set of transactions which also may 59:12 have log entries on on a quorum but 59:14 because a database server crashed midway 59:16 through those transactions they can 59:18 never be completed and for those 59:23 transactions that haven't completed in 59:24 addition there may be you know we may 59:27 have a situation in which you know maybe 59:30 log entry this server has log on three 59:33 hundred 59:33 and the Surrey has logon 302 and there's 59:36 a hundred and four somewhere but no you 59:41 know for I as yet uncommitted 59:42 transaction before the crash made me 59:44 know server got a copy of log entry 103 59:48 so after a crash and remember the new 59:52 database service recovering it does 59:54 quorum reads to basically find the point 59:56 in the log the highest log number for 59:59 which every preceding log entry exists 60:02 somewhere in the storage service so 60:04 basically it finds the first missing the 60:07 number of the first missing log entry 60:08 which is 103 and says well and so we're 60:12 missing a log entry we can't do anything 60:14 with a log after this point because 60:16 we're like missing an update so the 60:20 database server does these quorum reads 60:21 it finds a hundred and three is the 60:23 first entry that's MIT that's I can't 60:27 you know I look at my quorum the 60:28 server's I can reach and 103 is not 60:31 there and the database server will send 60:32 out a message to all the server saying 60:34 look please just discard every log entry 60:37 from 103 onwards and those mussels 60:39 necessarily not include log entries from 60:43 committed transactions because we know a 60:45 transaction can't commit until all of 60:46 its entries are on a right corner so we 60:49 would be guaranteed to see them so we're 60:50 only discarding log entries from 60:53 uncommitted transactions of course so 60:58 we're sort of cutting off the log here 60:59 at login 302 these log entries that 61:03 we're preserving now may actually 61:04 include log entries from uncommitted 61:07 transactions from transactions that were 61:08 interrupted by the crash and the 61:10 database server actually has to detect 61:12 those which you can by seeing a hope you 61:14 know a certain transaction there's it 61:16 has update entries in the log but no 61:18 commit record the database server will 61:20 find the full set of those uncompleted 61:22 transactions and basically issue undo 61:25 operations I sort of knew log entries 61:28 that undo all of the changes that that 61:32 that those uncommitted transactions made 61:35 and you know that's the point at which 61:38 Aurora needs this these old values in 61:41 the log entries so that a 61:44 server that's doing recovery after a 61:46 crash can sort of back out of partially 61:49 completed transactions all right one 62:00 another thing I'd like to talk about is 62:04 how Aurora deals with big databases so 62:09 so far I've explained the storage setup 62:13 as if the database just has these six 62:17 replicas of its storage and if that was 62:20 all there was to it basically a database 62:22 couldn't be you know each of these just 62:23 a computer with a disk or two or 62:25 something attached to it if this were 62:28 the way the full situation then we 62:31 couldn't have a database that was bigger 62:32 than the amount of storage that you 62:34 could put on a single machine there's 62:36 the fact that we have six machines 62:37 doesn't give us six times as much usable 62:39 storage because each one I'm storing a 62:41 replica of the same old data again and 62:43 again and you know so I want to use 62:46 solid-state drives or something we can 62:48 put you know terabytes of storage on a 62:50 single machine but we can't put you know 62:53 hundreds of terabytes on a single 62:55 machine so in order to support customers 62:59 who need like more than ten terabytes 63:02 who need to have vast databases Amazon 63:06 is happy Amazon will split up the 63:09 databases data onto multiple sets of six 63:12 replicas so and the kind of unit of 63:19 sharding the unit of splitting up the 63:21 data I think is 10 gigabytes so a 63:23 database that needs 20 gigabytes of data 63:25 will use two protection groups these 63:28 these PG things to its data you know sit 63:32 on half of it will sit on the six 63:35 servers of protection Group one and then 63:41 they'll be another six servers you know 63:44 possibly a different set of six storage 63:46 servers because Amazon's running and 63:48 like a huge fleet of these storage 63:49 servers that are jointly used by all of 63:51 its Aurora customers the second ten 63:54 gigabytes of the databases 20 gigabytes 63:57 of data 63:58 we'll be replicated on another set of 64:02 you know typically different I'll you 64:05 know there could be overlap between 64:06 these but typically just a different set 64:08 of six server so now we get 20 gigabytes 64:11 a day done and we have more of these as 64:15 a database goes bigger one interesting 64:18 piece of fallout from this is that while 64:21 it's clear that you can take the data 64:25 pages and split them up over multiple 64:28 independent protection groups maybe you 64:30 know odd numbered data pages from your 64:32 b-tree go on PG one and even number 64:35 pages go on PG - it's good you can shard 64:38 split up the data pages it's not 64:40 immediately obvious what to do with a 64:41 log all right how do you split up the 64:44 log if you have two of these two 64:46 protection groups or more in a mantra 64:48 tection group and the answer that amazon 64:52 does is that that that Aurora uses is 64:54 that the database server when it's 64:55 sending out a log record it looks at the 64:57 data that the log record modifies and 64:59 figures out which protection groups 65:03 store that data and it sends each log 65:06 record just to the protection groups 65:08 that store data that's mentioned that's 65:11 modified in the log entry and so that 65:14 means that each of these protection 65:16 groups store some fraction of the data 65:19 pages plus all the log records that 65:22 apply to those data pages see these 65:25 protection groups stores a subset of a 65:27 log that's relevant to its pages so a 65:36 final maybe I erase the photons 65:41 requirements but a final requirement is 65:43 that if a if ass one of these storage 65:48 servers crashes we want to be able to 65:50 replace it as soon as possible right 65:53 because you know if we wait too long 65:55 then we risk maybe three of them are 65:57 four of them crashing and a four of them 65:58 crash then we actually can't recover 66:01 because then we don't have a reform 66:02 anymore so we need to regain replication 66:05 as soon as possible if you think about 66:08 any one storage server sure this this do 66:11 which server is storing 10 gigabytes for 66:13 you know my databases protection group 66:15 but in fact the physical thing you know 66:17 the physical setup of any one of these 66:19 servers is that it has a you know maybe 66:21 a one or two or something 66:23 terabyte disk on it that's storing 10 66:26 gigabyte segments of a hundred or more 66:31 different Aurora instances so what's 66:34 what's on this physical machine is you 66:37 know 10 terabyte era byte or 10 66:39 terabytes or whatever of data in total 66:42 so when there's a when one of these 66:44 storage servers crashes it's taking with 66:47 it not just the 10 gigabytes from my 66:50 database but also 10 gigabytes from a 66:53 hundred other people's databases as well 66:55 and what has to be replicated is not 66:58 just my 10 gigabytes but the entire 67:00 terabyte or whatever or more that's 67:03 stored on this servers solid-state drive 67:05 and if you think through the numbers you 67:08 know maybe we have 10 gigabit per second 67:10 network interfaces if we need to move 10 67:15 terabytes across a 10 gigabyte per 67:18 second network interface from one 67:19 machine to another it's gonna take I 67:22 don't know a thousand seconds ten 67:25 thousand seconds maybe ten thousand 67:26 seconds and that's way too long right we 67:31 don't want to have to sit there and wait 67:32 you know it we don't want to have a 67:34 strategy in which the way we weak we can 67:37 reconstruct this is to find is to have 67:40 another machine that was replicating 67:41 everything on it and had that machine 67:43 send 10 terabytes to a replacement 67:46 machine we're gonna be able to 67:48 reconstruct the data far faster than 67:50 that and so the actual setup they use is 67:52 that if I have a particular storage 67:56 server 67:57 it stores many many segments you know 68:01 replicas of many 10 gigabyte protection 68:04 groups so maybe this protection group 68:07 maybe this segment that it's storing 68:09 data for the other envy for this one the 68:12 other replicas are you know these five 68:17 other machines all right so these are 68:19 all storing 68:22 segments of protection group a and so 68:25 you know there's a whole bunch of other 68:26 ones that we're also storing so I mean 68:27 we may be this particular machine also 68:29 stores a replica for protecting group B 68:33 but the other copies of the data for B 68:36 are going to be put on a disjoint set of 68:38 servers right so now there's five 68:41 servers that have the other copies of B 68:43 and so on for all of the segments that 68:48 this server that are sitting on this 68:50 storage servers hard drive for you know 68:52 many many different Aurora instances so 68:55 that means that this machine goes down 68:57 the replacement strategy is that we pick 69:00 if we're say we're storing a hundred of 69:01 these segments on it we pick a hundred 69:04 different storage servers each of which 69:09 is gonna pick up one new segment that is 69:13 each of which is going to now be 69:14 participating in one more protection 69:17 group so one one we miss like one server 69:20 to be replicate on for each of these ten 69:22 gigabytes segments and now we have you 69:24 know maybe 100 sort of different segment 69:28 servers and you know I probably storing 69:29 other stuff but they have a little bit 69:30 of free disk space and then for each of 69:32 these we pick one machine one of the 69:35 replicas that we're going to copy the 69:38 data from one of the remaining replicas 69:39 so maybe for a we're going to copy from 69:41 there for B from here you know if we 69:43 have five other copies with C we pick a 69:47 different server for C and so we have we 69:50 copy a from this server to that server 69:53 and B like this and C like this and so 69:57 now we have a hundred different 10 70:01 gigabyte copies going on in parallel 70:03 across the network and assuming you know 70:07 we have enough servers that these can 70:09 all be disjoint and we have plenty of 70:11 bandwidth in switching network that 70:14 connects them now we can copy our 70:17 terabyte or 10 terabytes or whatever of 70:20 data and total in parallel with a 70:23 hundredfold parallelism and the whole 70:25 thing will take you know 10 seconds or 70:27 something instead of taking a thousand 70:29 seconds if there were just two machines 70:30 involved anyway so this is 70:34 this is the strategies they use and it 70:35 means that they can recover you know for 70:37 machine dies they can recover in 70:39 parallel from one machine's death 70:41 extremely quickly if lots of machines 70:45 diets doesn't work as well but they can 70:49 recover from single they can be 70:50 replicate from single machine crashes 70:52 extremely quickly alright so a final 70:58 thing that the paper mentions if you 70:59 look at figure three you'll see that not 71:02 only do they have this main database but 71:06 they also have replica databases so for 71:09 many of their customers many of their 71:12 customers see far more read-only queries 71:14 than they see readwrite queries that is 71:17 if you think about a web server if you 71:19 just view a web page on some website 71:21 then chances are the web server you 71:24 connected to has to read lots and lots 71:25 and stuff in order to generate all the 71:28 things that are shown on the page to you 71:30 maybe hundreds of different items have 71:32 to be read out of the database or so out 71:33 of some database but the number of 71:35 writes for a typical web page view is 71:37 usually much much smaller maybe some 71:39 statistics have to be updated or a 71:41 little bit of history for you or 71:42 something so you might have a hundred to 71:44 one ratio of reads to writes that is you 71:48 may typically have a large large large 71:50 number of straight read only database 71:54 queries now with this set up the writes 71:57 can only go through the one database 71:59 server because we really can only 72:01 support one writer for this storage 72:03 strategy and I think you know one place 72:06 where the rubber really hits the road 72:07 there is that the log entries have to be 72:09 numbered sequentially and that's easy to 72:11 do if all the writes go through a single 72:13 server and extremely difficult if we 72:15 have lots of different servers all sort 72:17 of writing in an uncoordinated way to 72:19 the same database so the writes really 72:21 have to be go through one database but 72:24 we could set up and indeed Amazon does 72:27 set up a situation where we have read 72:29 only database replicas that can read 72:32 from these storage servers and so the 72:35 full glory of figure three is that in 72:38 addition to the main database server 72:40 that handles the write requests there's 72:42 also a set of read-only 72:48 databases and they say they can support 72:50 up to 15 so you can actually get a lot 72:53 of you know if your senior we'd have you 72:56 workload a lot of it can be you know 72:58 most of it can be sort of hived off to a 73:01 whole bunch of these read-only databases 73:02 and when a client sends a read request 73:05 to read only database what happens is 73:07 the read only database figures out you 73:09 know what data pages it needs to serve 73:11 that request and sends reads into the 73:14 directly into the storage system without 73:15 bothering the main readwrite database so 73:21 the the read-only replica database 73:23 ascend page requests read requests 73:25 directly the storage servers and then 73:27 they'll be no cache those pages so that 73:31 they can you know respond to future read 73:33 requests right out of their cache of 73:35 course they need to be able to update 73:36 those caches and for that reason Aurora 73:40 also the main database sends a copy of 73:43 its log to each of the read-only 73:46 databases and that's the horizontal 73:49 lines you see between the blue boxes and 73:51 figure three that the main database 73:52 sends all the log entries do these mean 73:55 only databases which they use to update 73:57 their cached copies to reflect recent 74:03 transactions in the database and it 74:05 means it does mean that the read only 74:07 database is lag a little bit behind the 74:09 main database but it turns out for a lot 74:12 of read-only workloads that's okay if 74:13 you look at a web page and it's you know 74:15 20 milliseconds out of date that's 74:17 usually not a big problem there are some 74:24 complexities from this like one problem 74:26 is that we don't want these relay 74:28 databases to see data from uncommitted 74:31 transactions yet and so in this stream 74:34 of log entries the database may need to 74:36 be sort of denotes which transactions 74:39 have committed and they're read-only 74:42 databases are careful not to apply 74:43 uncommon 74:44 uncommitted transactions to their caches 74:47 they wait till the transactions commit 74:49 the other complexity that these 74:54 read-only replicas impose is that 74:59 the the the these structures he of these 75:03 andhe structures are quite complex this 75:05 might be a b-tree it might need to be 75:06 rebalanced periodically for example I'm 75:09 the rebalancing is quite a complex 75:10 operation in which a lot of the tree has 75:12 to be modified in atomically and so the 75:15 tree is incorrect while it's being be 75:17 balanced and you only allowed to look at 75:19 it after the rebalancing is done if 75:21 these read-only replicas directly read 75:23 the pages out of the database there's a 75:25 risk they might see the be tree that the 75:28 database that's being stored here in 75:30 these data pages they may see the bee 75:31 tree in the middle of a rebalancing or 75:34 some other operation and the data is 75:37 just totally illegal and they might 75:39 crash or just malfunction and when the 75:43 paper talks about mini transactions and 75:45 the vdl verses vcl distinction what it's 75:49 talking about is the machinery by which 75:51 the database server can tell the storage 75:54 servers look this complex sequence of 75:57 log entries must only be revealed all or 76:02 nothing' atomically to any read-only 76:04 transactions that's what the mini 76:07 transactions and VDL are about and 76:09 basically the read when a read only 76:10 database asks to see data a data page 76:13 from a storage server the storage server 76:15 is careful to either show it data from 76:17 just before one of these sequence many 76:20 transaction sequences of log entries or 76:23 just after but not in the middle all 76:28 right so that's the all the technical 76:33 stuff I have to talk about just to kind 76:34 of summarize what's interesting about 76:36 the paper and what can be learned from 76:37 the paper one thing to learn which is 76:41 just good in general not specific to 76:43 this paper but everybody in systems 76:45 should know is the basics of how 76:48 transaction processing databases work 76:50 and the sort of impact that the 76:53 interaction between transaction 76:55 processing databases and the storage 76:58 systems because this comes up a lot it's 77:00 like a pervasive you know the 77:01 performance and crash recoverability 77:05 complexity of running a real database 77:07 just comes up over and over again in 77:10 systems design another thing to learn 77:13 this paper is this idea of quorums and 77:15 overlap the technique of overlapping 77:18 read/write quorums in order to always be 77:20 able to see the latest data but also get 77:22 fault tolerance and of course this comes 77:24 up in raft also raft has a strong kind 77:27 of quorum flavor to it 77:29 another interesting thought from this 77:31 paper is that the database and the 77:33 storage system are basically Co designed 77:35 as kind of an integrated there's 77:37 integration across the database layer 77:39 and the storage layer or nearly 77:41 redesigned to try to design systems so 77:43 they have you know good separation 77:45 between consumers of services and the 77:49 sort of infrastructure services like 77:50 typically storage is very 77:52 general-purpose not aimed at a 77:54 particular application just you know 77:57 because that's a pleasant design and it 78:00 also means that lots of different uses 78:01 can be made of the same infrastructure 78:03 but here the performance issues were so 78:06 extreme you know they would have to get 78:07 a 35 times performance improvement by 78:09 sort of blurring this boundary this was 78:13 a situation in which general-purpose 78:14 storage was actually really not 78:16 advantageous and they got a big win by 78:19 abandoning that idea and a final set of 78:22 things to get out of the papers all the 78:24 interesting sometimes kind of implicit 78:26 information about what was valuable to 78:30 these Amazon engineers who you know 78:32 really know what they're doing about 78:35 what concerns they had about cloud 78:37 infrastructure like the amount of worry 78:41 that they put into the possibility of an 78:42 entire availability zone might fail it's 78:45 an important tidbit the fact that 78:48 transient slowness of individual storage 78:51 servers was important is another thing 78:53 that actually also comes up a lot and 78:57 finally the implication that the network 79:00 is the main bottleneck because after all 79:02 they were it went to extreme lengths to 79:04 send less data over the network 79:06 but in return the storage servers have 79:08 to do more work and they put it they're 79:10 willing to you know 6 copies the data 79:12 and have 6 CPUs all replicating the 79:16 execution of applying these redo log 79:19 entries apparently CPU is relatively 79:21 cheap for them whereas the network 79:24 capacity was extremely important 79:26 all right that's all I have to say and 79:31 see you next week